Wikidata Case Report: Ingesting Production Databases of the Performing Arts edit

30 June 2018

Authors:

Julia Beck, Specialised Information Service Performing Arts, Frankfurt
Beat Estermann, OpenGLAM CH, Bern
Birk Weiberg, Swiss Archive of the Performing Arts, Zürich / Bern / Lausanne

Introduction edit

Production databases are at the core of performing arts databases. They contain event-related information about who performed where and when which works written by whom, with contributions by whom. In some institutions, like the Swiss Theatre Collection, Operabase, or Carnegie Hall, such databases have been assembled explicitly, aiming for full coverage of all productions within a given scope (time, place, genre, agent), and may serve as the main finding aid of archival collections (as is for example the case at the Swiss Theatre Collection). Other theatre archives have traditionally pursued an object-focused approach in their collection databases, which typically means that event-related information can be found in various metadata fields related to concrete objects. This information is usually only partly standardized, and individual organizations mostly do not aim for full coverage of all productions within a given scope, but focus on the productions for which they have objects in their collection. The Specialised Information Service Performing Arts in Frankfurt, who is assembling data from many different theatre archives in its online database, is presently in the process of extracting such event-related data from the various data sources in order to present it in a more coherent fashion and to make it searchable on its platform.

The original sources based on which production databases are compiled comprise theatre yearbooks, annual theatre programmes, playbills related to specific productions, and possibly further archival documents. Thus, where full coverage of all productions within a given scope has not been achieved for some periods, venues, or agents in the past, full coverage could often be achieved by means of systematic compilation of the information preserved in archives – both by heritage institutions themselves or by their users (e.g. researchers, encyclopædists, etc.). Furthermore, compilation of data about current events could easily be organized through crowdsourcing by theatre enthusiasts or theatre companies themselves, as the original data sources are widely available and often end up in the private collections of theatre enthusiasts.

Pilot Ingest of the Production Database of Schauspielhaus Zürich 1938-1968 edit

In order to demonstrate the possibility of using Wikidata as a platform for the creation of an international production database for the performing arts, a pilot ingest of an existing production database was carried out in 2017/2018. For the pilot ingest, the datafile “Repertoire des Schauspielhauses Zürich, 1938-1968” was used, which contains data about some 700 (spoken) theatre productions staged at Schauspielhaus Zürich between 1938 and 1968. Schauspielhaus Zürich is one of the most prominent and important theatres in the German-speaking world, and played a particularly important role as a German exile theatre during the Nazi period. The dataset had been made available by the Municipal Archives of the City of Zürich at the occasion of the Zürich Archival Hackday 2017, and work on the pilot ingest was further pursued during the Swiss Open Cultural Data Hackathon 2017 in Lausanne. Compared to the Swiss Theatre Collection’s production database, the dataset is small, limited in scope (only spoken theatre), but of particularly good quality (almost no prior data cleansing needed).

Procedure of the Pilot Ingest edit

To carry out the pilot ingest, we proceeded as follows:

Data fields in the original datafile were mapped to existing ontology structures in Wikidata, and where necessary, additional classes and properties were defined in Wikidata.
Wherever possible, the data fields in the original dataset were complemented with data fields containing the respective Wikidata Q number (reconciliation). In the case of persons for which no entry existed in Wikidata, such entries were newly created on the basis of the Schauspielhaus dataset. Some new data fields were created in the original dataset to make implicit information explicit in view of the data ingest into Wikidata (e.g. language of the play, genre, or production company).

These two steps were carried out in parallel, with only few interactions between them to ensure that all the data fields needed for the ingest would be present in the source dataset in the required form.

The third step consisted in writing a PHP script that draws both on the enhanced source data set and on the mapping information to produce code that could then be fed into the Quick Statements Tool.

The fourth step consisted in feeding the data into Wikidata – which was a matter of 24 hours and represented a rather small effort compared to the other tasks – and in carrying out a quality check, followed by some corrections.

In the following sections, the various steps are described in more detail, relating the approaches taken and the challenges encountered.

Ontology Development and Mapping edit

For ontology development and mapping we were able to draw on our earlier work related to the modelling of the data held by the Swiss Theatre Collection (among them data about 55'000+ professional performing arts productions) and the Swiss Dance Collection which had resulted in an integrated data model (Data Model for the Swiss Performing Arts Platform, Draft Version 0.51) that largely draws on existing data models in the field.

Thus, in a first step, the source dataset was matched against the Data Model for the Swiss Performing Arts Platform, and in a second step, corresponding classes and properties were identified and if necessary, newly created on Wikidata. Both the data modelling issues and the resulting mapping table were documented on a wiki page specific to the ingest of production databases from Switzerland; the existing property table for theatrical productions was completed where necessary. The pilot ingest was also used as an occasion to document and complement typologies for genres of the performing arts and performance types.

Before our ingest, Wikidata counted only about 125 performing arts productions and about 100 cast members related to performing arts productions. There were no series of performances (used in the case of guest performances), no theatre seasons, and no individual stages. Modelling data from a rather new field in Wikidata comes with its difficulties, as it is not easy to find existing data structures and to identify established modelling practices. We used existing wiki projects with their property lists as well as existing Wikipedia entries as starting points for our searches and made extensive use of the property search function on the Wikidata platform (advanced search functionality).

Having gained a first overview, it was quite easy and straightforward to create new classes. Creating new properties is however a completely different matter, as property proposals need to be discussed with the community and well argued to achieve community consensus. To do so, we proceeded as follows: First, we did a full analysis of the data modelling issues in relation to the pilot ingest and documented them on the wiki page mentioned above, along with proposals for the creation of new classes and properties. We then drew the attention of the community to our documentation page by posting messages to the Project Chat, relevant WikiProjects on Wikidata, as well as to the Wikidata + GLAM Facebook page. We received some feedback, and a few exchanges with fellow Wikidataists took place. We then moved on to create the new classes and submitted property proposals for the new properties. We took an active role in the ensuing discussion and sent again messages to the chat and Facebook page to draw people’s attention to the ongoing property discussions. The discussions were held in a positive spirit, and a few weeks later, the needed properties were either created or we had, together with the community, found a viable alternative solution.

Entity Reconciliation edit

For entity reconciliation OpenRefine was used. A prerequisite for reconciling entities in OpenRefine is their separation in individual cells. Fields with comma-separated lists of persons were thus split into several columns. The reconciliation of entities against Wikidata itself is a well documented process that is performed column-wise (screenshots). In the case of persons this works quite well as the matches can be restricted to an explicit type (Q5 human). Attempts to identify literary references, on the other hand, proved to be difficult not only due to missing entries and the lack of translated labels in Wikidata, but also because existing entries were classified inconsistently. This made it difficult to reconcile against a specific type.

All unmatched person names were exported separated by functions (director, actor/actress, scenographer, etc.), and with small custom scripts Quick Statements code was generated to add them to Wikidata. Subsequently, in OpenRefine, they could be reconciled as described above. In the case of works the performing arts productions are based on, the creation of new entries was not as straightforward: Theatre productions are usually labelled in the language of the performance; in the case of Schauspielhaus Zürich, most labels are in German, regardless of the original language of the work. As a result, there was no fool-proof way to check whether a work was actually missing in Wikidata or whether the item existed, but did not have a German label (yet). In order to avoid creating many duplicate items, we renounced to adding a based on (P144) statement where reconciliation against existing work entries was not possible.

Data Ingest and Quality Check edit

Preparing the data can take time and thus delay ingests, especially if it also involves developing the data model and applying for new properties. Missing entities (e.g. for persons or organizations) need to be newly created in Wikidata and the corresponding Q-numbers need to be written into the source file, along with the Q-numbers with already existing entities. Partial ingests, however, may be confusing for other Wikidataists and we used a small Python script to validate that entities created earlier were not deleted in the meantime. Ingesting the data with the Quick Statements Tool was straightforward. A PHP script was used to transform the CSV data of the source file into Quick Statements code. The main data ingest was done with Quick Statements Version 1, as it seems to come with better error reporting. Error messages were copied to a Word document for later follow-up, along with other issues encountered during the ingestion process. Before ingesting the big bulk of the data, the code was tested on a couple of items and the script was adapted accordingly. The theatre productions for which data was ingested in the context of the pilot ingest can be retrieved by means of a SPARQL query.

The following errors were encountered at the time of data ingestion:

Two Q-numbers for persons were not found, as the item had been deleted (probably due to the creation of duplicate items). These errors could easily be corrected manually.
One entry already existed; probably a test case that we had overlooked. The two items were merged manually.
Six entries in the source database turned out to be problematic as they contained data about two plays that were given on the same night; these entries resulted in errors in the code produced by the script. These errors had to be followed up separately, as additional data cleansing and data modelling was required to handle these special cases. This issue, along with some minor errors in the source database, has been reported to the maintainer of the dataset.
One line of code was not properly processed by the API (“Bad API response”). The omitted statement was added manually.

In addition, the following issues were identified when inspecting the resulting items:

One typo in the source dataset was spotted by chance. It was corrected and was reported to the maintainer of the dataset.
In the field for “production company”, the source dataset mistakenly had Schauspielhaus Zürich (Q675022): theatre building in Zurich, Switzerland, instead of Schauspielhaus Zürich (Q40313234): theatre production company in Zurich . The error was corrected by extracting all the problematic items by querying the Wikidata SPARQL endpoint and by using Quick Statements Version 2 to delete the erroneous entry and to add the correct one.
Two data fields were omitted in the script. To remedy this situation, the Q numbers of the newly created production items were introduced into the source dataset using OpenRefine (cf. instructions how to do this), the script was corrected, and the additional code was again fed into the Quick Statements Tool.
We also noticed that we had run into a bug in Quick Statements Version 1, which had the effect that data about actors that played several roles in the same play were ingested only for one role. We identified the problematic statements in the source file and re-ingested them using Quick Statements Version 2.

Best Practices / Recommendations edit

Based on the experiences gathered throughout the pilot ingest, we can make the following recommendations:

Always keep track of your work steps, both for yourself (as the data ingestion process may take longer than expected, and in case of errors, you may need to repeat some of the steps) and for posterity, as other people may want to ingest similar datasets.
Document your reflections regarding the various data modelling issues encountered. This is both useful in the context of property discussions and in view of future ingests or data cleansing activities on Wikidata.
Always provide a source when creating new items (e.g. for works or persons) that you will need as objects in statements contained in your main dataset. In case of delays with the data ingest, this may prevent you from having some of the newly created items deleted by fellow-Wikidataists who rightly consider them as orphaned items.
When creating new items about persons, make sure that you also enter some data on their occupation, date of birth/death, etc., as there are potentially many people with the same name.

Remaining Challenges and Outlook edit

Several challenges that were observed during the pilot ingest remain to be solved. To conclude our report, we are listing the main ones which we are intending to address over the coming months and years.

Ontology Development and Documentation edit

In general, documentation of ontologies and widely accepted modelling practices on Wikidata needs improvement. Existing data entries need to be inspected and corrected where necessary. Modelling rules need to be documented and constraints should be defined accordingly in order to draw users’ attention to problematic entries. Providing modelling examples in form of graphics would be helpful. Documentation of data modelling practice also facilitates correct querying of the data. At present, special cases may be missed in the queries, due to poor documentation.
In order to facilitate participation by users from various countries, multilingual descriptions and definitions of key classes and properties should be provided.
The ontology should be extended to also be able to describe collections and archival structures and eventually to provide pointers to archival material related to specific performing arts productions.
Some data modelling details related to the rendering of FRBR group 1 classes and the use of properties related to them should be reviewed (see our respective comments on the data modelling issues).
Eventually, the respective documentation of WikiProject Theatre and WikiProject Performing Arts should be merged.

Data Ingestion edit

In the course of the pilot ingest we have not been able to link all the productions to the (literary) work they are based on, mainly because the corresponding entries were missing on Wikidata, and because the source data file did not contain sufficient information to create new items. In order to properly render the works and their various translations and/or adaptations we would need to be able to draw on reliable data sources which properly distinguish between original works and their expressions/manifestations. Similarly, when it comes to linking the performers to their character roles, it would be preferable to first ingest the character roles related to a work on the basis of the literary source and not on the basis of production databases, which contain the character roles present in the given production, which may correspond or not to the character roles present in the literary source. Note that the names of character roles (and even their proper names) may differ between translations or adaptations of a play.
Further bulk data ingests of production databases should be encouraged. As the item statistics show, at the time of writing, most of the Wikidata entries for performing arts productions and their cast members as well as all the entries for theatre seasons and individual stages have been the result of our pilot ingest.
Performing arts related data could also be imported from Wikipedia. There are various instances of lists of theatrical works, lists of performing arts productions, etc. By importing these lists into Wikidata this information would become available for use across all language versions of Wikipedia.
If we want to encourage data entry by volunteers on a production by production basis, e.g. on the basis of individual theatre programmes, we would need to be able to offer customizable forms for data entry that can be adapted to the typical data structures present in such programmes. Manual data entry on Wikidata is presently rather tedious.

Data Use edit

In the longer term, performing arts related data from Wikidata should be used within Wikipedia (in templates or in lists). There are presently many instances of such templates and lists on Wikipedia, but data is usually not ingested into and pulled from Wikidata. By systematically maintaining such data on Wikidata, it would become easier to re-use it across many different language versions of Wikipedia.
We should also develop and document further use cases of the resulting performing arts database (e.g. in the field of research, education, training, but also from the perspective of performing arts professionals themselves and from the perspective of theatre archives).

Wikidata:WikiProject Performing arts/Reports/Ingesting Production Databases of the Performing Arts

Contents