Tuesday, 19 March 2013

Crowd Sourcing and Digital Editing

The term is almost over, and finally I find some time to write down soem of the thoughts I have been munching in the past months. Apologies to my numerous followers for the long silence.
The topic this time is crowd-sourcing, which is a bit unusual for me as I have not been directly involved in any crowd-sourcing project, but as many of you already knows, I'm working at a book on Digital Scholarly Editing, which inevitably force me to consider new form of edition such as, for instance, crowd-sourcing and its role in editing.  
A King's College London project devoted to the classification of different types of crowd-sourcing activity is just concluded by producing a hefty report written by Stuart Dunn, but as per admission of its author, the classification contained in the report is a bit too comprehensive to be really useful for my purpose, so Here you have the one I have created (yes, all by myself!). Comments are welcomed!

Without any pretension of being exhaustive, crowdsourcing concerning in some way the edition and publications of texts can be classified according to five parameters:
1.     Context: Crowdsourcing projects can be hosted and supported by: 
a.     Universities and Cultural Heritage Institutions, such as Libraries and Museums. This is the case of some of the projects mentioned above (TranscribeBentham is hosted and supported by UCL, for instance), and of the National Library of Australia’s Historic NewspaperDigitisation Project, where users have been asked to correct OCRed articles from historical newspapers.
b.     Non-governmental organisations and other private initiatives: It is the case, for instance, of the Project Gutenberg, which began 19971 from the vision of his founder, Michael Hart and continued since thanks to donations.
c.     Commercial: it is the case, for instance, of Google that uses the ReCAPTCHA service, asking users to enter words seen in distorted text images onscreen, a part of which comes from unreadable passages of digitised books, thus helping the correction of the output of the OCR process, while protecting websites from internet robots (the so-called ‘bots’) attacks.
2.     Participants: or better, how are they recruited and which skills should they possess to be allowed to contribute. Some project issues open calls, for which anybody can enrol and contribute at their wish, with no particular skill being required other than commitment; other projects require their contributors to possess specific skills, which are checked before the user is allowed to do anything. The former is the case for the Historic Newspaper Digitisation or the Project Gutenberg, the latter for the EarlyEnglish Laws project. Many projects collocate themselves in between these two categories, closer to one end or the other. In the SOL project, for instance, users are assumed to read and understand Greek, but their competence is verified by the quality of their translations, although to register as editors, users are expected to declare their competences, which are checked by the editorial board.
3.     Tasks: The tasks requested to the users could be one or more of:
a.     Transcribing manuscripts or other primary sources, like in the case of Transcribe Bentham.
b.     Translating: as in the case of SOL.
c.     Editing, which is requested by the Early English Law project.
d.     Commenting and Annotating: as in the case of the Pynchon Wiki 
e.     Correcting: this is the case, for instance of the National Library of Australia’s project seen above and of the Project Gutenberg, where users not only contributes by uploading new material, but also take on proofreading texts in the archive.
f.      Answering to specific questions: this is the case for the Friedberg Genizah Project, for instance, which uses the project Facebook page to ask specific questions to its followers about, for instance, a particular reading of a passage, or if the hand of two different fragments is the same, and so on.
4.     Quality control: the quality of the work produced by the contributors can be assessed professional staff hired for that purpose (e.g. Transcribe Bentham), or could be assured by the community itself, with super-contributors which controlling roles are gained on the field by becoming major contributors (e.g. Wikipedia), or because of their qualifications (e.g. SOL), or both.
5.     Role in the project: for some project the crowdsourced material can be the final aim of the project, like for the Project Gutenberg or the Historical Newspapers Digitization project, or it could be a product that will be used in other stages of the project. The transcriptions produced within Transcribe Bentham project serve a double purpose: they represent the main outcome of the project as, once their quality has be ascertained, they feed into UCL’s digital repository, but they are also meant to be used for the edition of The Collected Works of Jeremy Bentham in preparation since 1958.

Is there anything else I should have included?

1 comment:

  1. Hi, Elena.

    Regarding #2, there is an active discussion going on in the crowdsourcing world about the role of training and testing as a disincentive to participation. Probably the strongest advocate for open participation is Chris Lintott, who I recall pointing out at an AHA2012 panel that self-evaluation (i.e. confidence) was never correlated to the quality of a participant's results, and that it's entirely possible to evaluate the quality of people's participation after they contribute and weight their contributions accordingly instead of attempting an evaluation before you allow them to contribute.

    Regarding #3, I'd like to draw your attention to the Harry Ransom Center's Manuscript Fragments project, which asks volunteers to identify medieval fragments used as binding in later books. (A good overview of the project is Micah Erwin's St. Louis presentation.) This sort of project isn't quite a simple question--maybe it's more reminiscent of the old tagging crowdsourcing projects of a half-decade ago than it is of transcription tools--but the identification of the texts and the scripts that volunteers have made are consistent, and usually high quality from what I understand.

    Regarding #4, you might be interested in my old classification of quality control strategies in crowdsourced transcription projects.

    Regarding #5, a lot of memory institutions are using crowdsourcing for patron engagement -- as a way to enhance the public's experience of the material and to convert volunteers into advocates for the collection, the institution, and the discipline. Trevor Owens wrote about this last year, and I suspect you'll hear Paul Flemons address this at SDSE this july.