Wikidata:WikiProject Cultural heritage/Reports/Ingesting Swiss heritage institutions

Author: Beat Estermann (20 November 2016)

Introduction to the Case Report edit

The present report describes the process of ingesting data about Swiss heritage institutions into Wikidata as it was carried out in fall 2016. Its purpose is to document the experiences and insights gathered during the process and to collect the reflections it triggered.

The first part is made up by a summary report written upon conclusion of the process, while the subsequent parts contain the notes made during the different stages of the ingestion process itself.

Summary Report edit

Background edit

For the purpose of the OpenGLAM Benchmark Survey I had compiled various lists of heritage institutions in Switzerland into one comprehensive inventory that I had also made available as open data on datahub.io and on a project page on German Wikipedia since 2014. An overview of the pre-existing lists that were used as sources for the inventory can be found here. Since its creation back in 2012 I have been maintaining the Swiss inventory to the best of my knowledge. There has however been little re-use of the data so far; as a consequence, I have little insight in its quality over time as some information may get deprecated quite rapidly. Similar lists exist for the other countries covered by the OpenGLAM Benchmark Survey, such as Brazil, Bulgaria, Finland, New Zealand, Poland, Portugal, The Netherlands, Russia, Spain, Sweden, and Ukraine.

Scope of phase 1 of the ingestion process edit

The goal of phase 1 of the ingestion process was to create a Wikidata item for every heritage institution in Switzerland and Liechtenstein with a basic set of properties that are sourced to the Swiss GLAM Inventory. The properties covered were: name of the institution (label in the original language for all the items and for most of the items in German); description (for most of the items in German and English; where the description was not obvious, it was left out to be treated later); institution type, including the information whether it is a memory institution or not; street address (physical location); zip code; municipality; canton; country; url of the website; and where applicable PCP number and PCP category (for institutions with collections listed in the Swiss inventory of cultural properties of national or regional significance).

Procedure followed edit

The following step-by-step procedure served as a guideline. It was primarily inspired from existing accounts of data ingestion on Wikidata and from earlier experiences with linked data publication. The original descriptions of the steps were:

  • Step 1: Get a thorough understanding of the data (class diagram: what are the classes? what are the properties? what are the underlying definitions?)
  • Step 2: Analyze to what extent Wikidata already has the required data structures (relevant classes and properties already defined?). If necessary, create the relevant classes and properties (community discussion required). Create a mapping between the source data and the target data structure on Wikidata.
  • Step 3: Check whether data formats in the source data correspond to acceptable formats on Wikidata. If not, some data cleansing may be required.
  • Step 4: Check whether there is an explicit or implicit unique identifier for the data entries. – If not, think about creating one (how best to go about this needs clarification).
  • Step 5: Check whether a part of the data already exists on Wikidata. If yes, create a mapping in order to add data to already existing items and to avoid creating double entries in Wikidata.
  • Step 6: Model the data source in Wikidata (how best to go about this needs clarification).
  • Step 7: Clean up existing data on Wikidata.
  • Step 8: Ingest the data, providing the source for each statement added (avoid overwriting existing data). Log conflicting values. At the beginning, proceed slowly, step by step; check which aspects of the upload process can be automatized using existing tools and which aspects need to be dealt with manually. Ask community members for a review of the first items ingested, before batch ingesting.
  • Step 9: Visualize the data using the Listeria tool or SPARQL queries in order to inspect the data (quality check).

This procedure proved quite reasonable, although I eventually ended up looping through some of the steps in an iterative manner. Also, the steps involving the writing of SPARQL queries are somewhat redundant; it makes sense to document all the queries in one place; modifying existing queries on the go as required by subsequent steps.

Tools edit

During the ingestion process I used the following tools:

  • Microsoft Excel / Open Office Calc: I have been maintaining the original GLAM inventory in Excel; that’s also where I made most of the additions to the data. Open Office Calc could be used as an alternative. I generally prefer Open Office Calc when it comes to saving and opening files to and from the CSV format as it gives you better control over the encoding options. A spreadsheet tool is also useful when it comes to listing and working on existing vocabularies.
  • Wikidata Query Service: The Wikidata query service is absolutely central when it comes to displaying or exporting data that already exists in Wikidata. You will need to get acquainted with the SPARQL query language. I did this directly using the W3C specification which turned out to be feasible; after an evening of trial and error I managed to write more and more complex queries. There is also a User Manual specifically for the Wikidata Query Service, as well as an introduction for beginners. I cannot say much about their usefulness as I haven’t used either of them. There are further instructions on how to write queries on the Wikidata wiki itself. I've found them useful when I had to work with qualifiers on a different project. And on a final note: If you are confronted with browser performance issues, try using the query service from within the Chrome browser (but note that there is a time-out for longer queries, with the acceptable length varying over time).
  • Open Refine (formerly Google Refine): At some point you will need to integrate data from two different data files by matching them on one shared key. This is for example the case when you want to map data already existing in Wikidata (let’s say the list of Swiss municipalities, comprising their official ID issued by the Federal Statistical Office and their Q-number on Wikidata) into your own dataset that serves as a basis for data ingestion. For this data matching, I resorted to Open Refine ( instructions of how to do this); you could also use a statistical package of your choice, such as R or SPSS, to do this quite easily, and it even seems to be possible using Excel, although I have never done it. Open Refine can also be used to further manipulate your data (e.g. splitting or merging columns), although you can also go quite far on this in Excel, using its TEXT functions for example.
  • Reconcile-csv Tool: Another task you want to use Open Refine for is matching two datasets which do not contain a shared key yet. Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system – so they can be used to join your data painlessly afterwards.
  • Listeria Tool: This tool is useful if you want to generate lists based on a series of Wikidata items within a Wikipedia environment. Mind though that it is generally not accepted in the article namespace; but it can be used in the project namespace or in the user namespace. I didn’t find it very helpful for the data ingestion process as it is much less powerful than the Wikidata Query Service when it comes to writing more complex queries. It has however one big advantage: thanks to embedding the resulting lists into a wiki page, it provides an easy and straightforward way to keep track of all the changes made to the data comprised in a given list, as the bot regularly scans Wikidata for relevant modifications and overwrites the Wikipage with the newer version (use the page history to compare subsequent versions). The Listeria Tool also exists in a standalone version where it is possible to define, customize, and save dynamic lists based on Wikidata.
  • Quick Statements Tool: At the time of writing, the Quick Statements Tool is indispensable if you want to ingest larger quantities of data into Wikidata without resorting to a bot of your own. As it turned out, the tool still has its limitations when it comes to ingesting data along with its references as the whole range of data models used on Wikidata isn’t supported yet by the tool. With some tweaking I found however a reasonable way to represent my references. For the rest, the tool is very easy to use and proved to be very reliable. When I used it, it wrote a triple (one line in the tool) in about 1-2 seconds. You can feed it lists of several thousands of triples at a time; when you interrupt your Internet connection, it will resume automatically upon re-connecting to the Internet. Make sure that you check for error messages in the tool's output (search for the string “ERROR”). Before feeding longer lists of data, start step by step, one item after the other, inspecting them to make sure that you don’t have any mistakes in your code. Especially make sure that you don’t create duplicate items! Otherwise you will have to do a lot of manual merging of items in order to clean the mess up. Also note that:
    • Existing labels and descriptions of Wikidata items are overwritten by those provided through the Quick Statements Tool, even if they are empty in your dataset! Therefore, avoid feeding empty label or description fields from your source file, unless you want to remove those already existing in Wikidata!
    • Other empty statements didn’t overwrite existing values. If your dataset contains empty fields for some values, make sure that you enter them as an empty string (""). They will be skipped by the tool. A completely empty field in a statement will lead the tool to come to a halt as it considers the statement as incomplete.
    • When the ingestion process is done (message: "All done"), it is worthwhile doing a quick search for the string "ERROR" in the ingestion log. It may help you spot potential double entries (items with the same label) or badly formed data (e.g. urls with a missing "http://". In the case of double entries, the double entry is actually generated; you will then have to check whether the items in question should be merged. Badly formed urls are simply skipped by the tool.
    • "http://www.example.org" and "http://www.example.org/" are recognized as two different strings. In the case of already existing items, you may generate double entries if the urls are not normalized beforehand.
The preceding observations have been made during the data ingestions process in fall 2016. As the Quick Statements Tool may be further developed in the future, you should not blindly rely on them, but read the documentation and run your own tests before ingesting larger chunks of data.
  • Microsoft Word / Excel: In order to generate the code to be fed into the Quick Statements Tool I used the mail merge functionality of Microsoft Word and Excel. It allows you to create templates for a given class of Wikidata items with all the properties, including references you want to add. Once you have created your template, you can link it to your Excel file with the data and generate the code needed to create new items with their properties or to add additional properties into existing Wikidata items. Once you have completed the mail merge process, you can just copy-paste the resulting document into the Quick Statements Tool.
  • Hatnote: “Listen to Wikipedia”: If you are getting bored while the Quick Statements Tool is ingesting your data, you can use this tool to listen to the data ingestion process. ;-)
  • Wikipedia-Bots to Monitor Newly Created Articles in Given Categories: Once you have created your items in Wikidata, you would expect that any newly created Wikipedia article relating to one of your items would automatically be linked to the pertinent Wikidata item. However, this may not be the case. If you want to monitor newly created articles in order to systematically link them to the pertinent Wikidata item, you may do so using the various Wikipedia-bots that exist to monitor newly created articles in given categories: I had been using this service in German Wikipedia even before data ingestion into Wikidata in order to keep track of any new Wikipedia articles on heritage institutions in order to complement my inventory. There are similar bots on French Wikipedia and English Wikipedia and probably also in other language versions.

Challenges edit

There is a series of challenges I faced during the ingestion process, some expected and others rather unexpected, some very time consuming and others rather anecdotal or peripheric to what I was aiming to do. In the end, none of them turned out to be insurmountable, but let’s nevertheless start with the tougher ones:

  • Ontology / Thesaurus Development: I ended up investing much more time than expected into the refinement of the ontology for heritage institutions, in particular into the creation of a typology of heritage institutions. The way Wikdata items have been generated based on information drawn from Wikipedia poses a series of challenges:
    • Several classes are conflated into one, as Wikipedia articles typically treat them without considering data modelling issues. Example: the class “museum” was defined as a subclass of “cultural institution” and as a subclass of “architectural structure”; while it makes sense to write one and the same Wikipedia article about the two aspects, it does not make much sense to combine them in one item from a data modelling perspective. My removal of the subclass “architectural structure” on the given Wikidata item was swiftly reverted by another user without further discussion.
    • Definitions of classes do not match across different language versions. This reflects the rather fuzzy linking of Wikipedia pages of various languages. For many Wikidata items, descriptions are simply missing or they are incomplete or erroneous given the subclasses an item is assigned to.
    • Some Wikdiata items seem to result from combined categories in Wikipedia or Wikimedia Commons (Example: “national museum in Thailand”), that are in turn a consequence of the missing possibility of faceted search based on Wikipedia / Commons categories. In Wikidata, in turn, it easy to write a query that outputs all the items of the intersection of “national museum” and “located in Thailand”, which makes it rather meaningless to define museum subclasses of the kind “national museum in Thailand”.
There are yet other challenges related to thesaurus development:
  • Existing thesauri (GND, Library of Congress) often lack definitions of concepts and sometimes use rather fuzzy matching between each other if they map their entries.
  • Moving existing thesauri to an international and multilingual space poses the problem of varying meanings across countries (example: the term “college” refers to rather different types of institutions depending on whether you are talking about the US, the UK or Australia). Not to mention the challenges that arise in a multilingual space. In my case, I was simply very lucky that pretty much all the subclasses of heritage institutions are described in English or German, two languages I master rather well; there were a couple more in Swedish and one in Russian for which I didn’t had any problems reading the corresponding Wikipedia article in order to grasp the meaning of the definition. Had the existing subclasses been described in languages which I don’t understand, this would have posed an almost insurmountable challenge. True internationalization of key thesauri seems to be an issue the library world hasn’t resolved yet. Further challenges related to ontology development have been described in the section "Analyse special cases and identify data modelling issues" below. They will need to be looked into at some point when it comes to providing guidelines to a broader community of Wikidataists (e.g. the definition of "museum").
  • Creation of new properties: While new entities or classes can be created freely in Wikidata without asking anyone for permission, this is not the case for the creation of properties. If you need to create a new property, you need to propose it in the appropriate place and wait for other community members to comment on it and eventually create it. In my case, it took about a month to have the new property created that corresponds to the newly created GLAM-ID which I use as a unique identifier in my source datafile.
  • Missing properties: Wikidata presently does not have any properties to properly render postal addresses. We still need to think about how to model postal addresses as opposed to physical addresses, as these two are not necessarily the same. For now, I have just skipped the postal addresses.
  • Cleansing the existing data: Coming to grips with the existing data modelling issues and cleansing the data of your domain that is already existing within Wikidata may represent a major task. For my part, I did a review of a random sample of the existing items (12 out of 230 items) in order to identify common data modelling issues and to decide how to deal with them. After data ingestion, it took me a few evenings to clean most of the previously existing data.

Observations / Suggestions edit

There is a series of reflections that occurred to me during the ingestion process that may be taken into account in further activities:

  • Coordination within the Wikidata community: The various Wikidata project pages dealing with cultural heritage issues should be better organized. For example, I found the project page about “museums” only after ingesting my data about heritage institutions. Several project pages have scope issues, e.g. “cultural heritage” exclusively referring to “historical monuments” and the like. We also need to come up with a standard way of linking items about classes to relevant discussion pages linked to pertinent Wikidata projects. Right now, discussions – if they do happen – are too scattered all over the place, and sometimes probably do not happen often enough, for lack of an appropriate space for them to take place. Especially in the area of collaborative ontology and thesaurus development, we will probably need more intense collaboration in the future and find ways to make common decisions stick in the everyday practice of those who generate new data entries.
  • Guidelines: We need better guidelines on how to ingest data into Wikidata. I would suggest gathering a series of case reports as the present one and distilling the key insights into a synthesis document.
  • Tools: The Quick Statements Tool should be improved in order to support all the conventions for reference declarations of Wikidata statements. Furthermore, it would be nice if the documentation was complemented with systematic information about the error testing that is done by the tool during data ingestion and about the handling of empty values. The Listeria Tool seems to have a built-in discovery function which is not properly documented and - depending on the use case - could be considered as a bug.
  • Data normalization: The way weblinks are stored within Wikidata should be normalized: ‘http://www.example.com’ should be stored as the same value as ‘http://www.example.com/’. Otherwise, many duplicate entries may inadvertently be created during automatic data ingestion.
  • Deprecated weblinks: There should be a policy regarding deprecated weblinks in Wikidata. Ideally a bot would be highlighting deprecated links for contributors to work through and decide whether they need updating or whether they should be kept for archival purposes (e.g. by adding a corresponding qualifier).
  • Inverse properties: We should think about having inverse properties (e.g. "has part" / "is part of" or "location" / "occupant") added automatically to the corresponding items. This requires some thought, however, on how to handle cases where these properties are being removed from items. Also, some of these (such as "location" / "occupant") may only hold for particular configurations (e.g. organization / building), but not for others.
  • Data quality: Enhancing and maintaining data quality within Wikidata will be one of the major challenges. Some reflections on this topic have already taken place elsewhere. My impression is that we need to address data quality both at the level of our ontologies and vocabularies and at the level of the individual data entries. Furthermore, we need to look not only for internal consistency of the data, but also for its fitness for purpose. To properly evaluate the latter, the data needs to be put to use – both within Wikipedia/Wikimedia projects and beyond – and the use cases need to be systematically studied. In this context we also need to address the question of data completeness. With regard to the quality of ontologies, the following research papers and dissertation provide interesting food for thought:
  • Burton-Jones, A., Storey, V. C., Sugumaran, V., & Ahluwalia, P. (2005). A semiotic metrics suite for assessing the quality of ontologies. Data & Knowledge Engineering, 55(1), 84-102.
  • Brank, J., Grobelnik, M., & Mladenic, D. (2005, October). A survey of ontology evaluation techniques. In Proceedings of the conference on data mining and data warehouses (SiKDD 2005) (pp. 166-170).
  • Obrst, L., Ceusters, W., Mani, I., Ray, S., & Smith, B. (2007). The evaluation of ontologies. In Semantic web (pp. 139-158). Springer US.
  • Strohmaier, M., Walk, S., Pöschko, J., Lamprecht, D., Tudorache, T., Nyulas, C., ... & Noy, N. F. (2013). How ontologies are made: Studying the hidden social dynamics behind collaborative ontology engineering projects. Web Semantics: Science, Services and Agents on the World Wide Web, 20, 18-34.
  • Vrandečić, D. (2009). Ontology evaluation. In Handbook on Ontologies (pp. 293-313). Springer Berlin Heidelberg.
  • Complexity of data models: Related to the pragmatic aspect of the quality of ontologies is the complexity of data models: Where do we strike a balance between creating more and more complex data models in order to be able to represent the reality in more detail on one hand and being able to write relatively simple queries against these models on the other hand? To give an example: It is tempting to extend the inventory of heritage institutions by ‘historicising’ it: They can have changing names over time, the municipality they are located in may merge with another one and change its name, they may change their address or they may close altogether. To keep the data model simple, we could just limit ourselves to representing the actual state; if we want to be able to describe a reality that is evolving over time, we can add “start date” and “end date” qualifiers to all of these properties. This would however make the data model more complex to query and it would make it harder to achieve data completeness (at least if you want to be able to make queries that render the situation at a particular moment of the past).
  • Better break-down of tasks: The data ingestion process as I carried it out comprises many steps of various complexity and requiring different kinds of know-how and tools. Furthermore, in order to achieve data completeness, further steps would be required, e.g. the adding of labels and descriptions in several languages. In order to enhance the ability to crowdsource some of these tasks, it would be useful to think about how to break down the various tasks into smaller chunks and to centrally track progress on these tasks.
  • Wikidata for primary data management? (vs. secondary database): As soon as the community starts adding additional properties or creates a data model that is more complex than the data model of the original source, Wikidata changes its character of being a secondary database into being a platform for primary data management. The same may happen in the area of ontology development: As soon as the definitions of concepts are becoming more refined on Wikidata than the ones used in the original data source, we may gradually slip into a situation where Wikidata becomes the primary locus for the integration and management of the data in a given domain. If this happens, we will need to consider to which extent and in what form changes to data models and to individual entries propagate back to our original data sources.

Next Steps edit

As noted above, the data ingestion described in this report represents just an intermediate step within a larger endeavour. From here several further steps are possible:

  • Aim for quality and completeness: All the items for the heritage institutions on a given territory should fulfil the requirements of showcase items and they should represent a complete set of items, e.g. covering all the heritage institutions existing on this territory at a given moment in time. On the aspect of completeness, see also [1].
  • Add further data fields: Add data regarding the postal address (still needs to be modelled on Wikidata), historicised data (opening date, closing date, etc.), as well as “is part of” and “has part” relationships.
  • Ontology/thesaurus development: Eliminate the flaws in the current ontology/thesaurus and the way they are used within Wikidata. This will require some community discussion.
  • Break-down of tasks / crowdsourcing: Break-down the tasks related to further improvement of the database into smaller chunks that can meaningfully be carried out by a larger number of people.
  • Develop use cases for the data: Put the data to use, both within Wikipedia and beyond, for example at hackathons, for apps, etc. There are good chances that the use of the data would trigger positive feedback loops not only in terms of quality and completeness of the data (thanks to increased attention and a better sense of what “fitness for purpose” in a given context actually means), but also with regard to the ingestion of further cultural data. An interesting avenue of course would be to systematically use the data in Wikipedia infoboxes, to generate lists of heritage institutions on Wikipedia project pages that can be worked through, creating an article for each of them, or by using the data on Wikidata as a basis for lists in the article namespace (for further information about the generation of Wikidata-based lists, see: Wikidata:List_generation_input)
  • Add data about heritage institutions in other countries: The Swiss example could be followed, and the data for the other countries for which we have virtually complete datasets could be ingested.
  • Integrate the existing databases covering heritage institutions in Switzerland: Start adding references to all relevant databases (e.g. the museums database of the Swiss Museums Association, the ISIL inventory (maintained by the Swiss National Library and to be published in form of linked open data in the course of 2017), or the inventories of Swiss archives maintained by infoclio.ch and arCHeco). The regular update of the databases (potentially working in both directions) could be institutionalized; further data fields could be entered into Wikidata.

Long Term Vision: Describing the Entire Heritage Domain in Wikidata edit

In the longer term, we could also consider how to describe the entire cultural heritage domain in Wikidata. There are many existing initiatives that fall into this area and that might benefit from some systematization and coordination, such as:

  • Wikidata:WikiProject Cultural heritage, with a focus on ingesting monuments databases, public art databases, and inventories of heritage institutions. See also: Wikidata:WikiProject_WLM, focusing on doing the necessary to run various tools related to the Wiki Loves Monuments photo contest directly based on Wikidata.
  • Wikidata:WikiProject_Modernisme, with the aim of improving items of Art Nouveau artworks located in Catalonia, which are currently spread in online databases throughout the Internet (and even offline sources).
  • Wikidata:WikiProject_Music, focusing on work on properties that can be used by music infoboxes, on the ingestion of all the music-related data that is presently spread across Wikipedia, Wikimedia Commons, and Wikisource, and on establishing methods to interact with this data from different projects.
  • Wikidata:WikiProject_Source_MetaData, focusing on citiation data and bibliographic data, with the aim of defining a set of properties that can be used by citations, infoboxes, and Wikisource, of mapping and importing all relevant metadata that currently is spread across Commons, Wikipedia, and Wikisource, of establishing methods to interact with this metadata from different projects, of creating a large open bibliographic database within Wikidata and of revealing, building, and maintaining community stakeholdership for the inclusion and management of source metadata in Wikidata.
  • Wikidata:WikiProject_Books, with a focus on ingesting data related to books, cited in Wikipedia, referenced on Wikimedia Commons or contained in Wikisource.
  • Wikidata:WikiProject_Periodicals, with the aim of defining a set of properties for infobox templates for journals and magazines, of defining a set of properties about periodical publishers, of mapping and importing data about journals cited by Wikipedia, of mapping and ingesting periodicals related data on Commons, Wikipedia, and Wikisource and of establishing methods to interact with this data from different projects
  • Wikidata:WikiProject_Authority_control, with the aim of coordinating work around authority control, with the objective of creating the “sum of all people”, with links to their works. Thus, Wikidata is becoming the place where the authority files of various heritage institutions are being co-referenced. See also: Wikidata:WikiProject_Genealogy, focusing on improving information about family trees on Wikidata, and Wikidata:WikiProject_Names, aiming to improve the structure of name related data in Wikidata.
  • Wikidata:WikiProject_Infoboxes, with the aim of identifying good examples of Wikipedia infoboxes and Wikidata items as use cases for phase II of Wikidata (facilitating auto-translation to Wikipedia infobox templates), of mapping and harmonizing Wikidata properties to common infobox parameters, of suggesting new Wikidata properties and of coordinating bot activities for collecting data from specific infoboxes and external sources. It covers infoboxes for persons (outdated), organizations, events, works, terms, and places (including buildings).

There is a series of typologies that can be drawn upon in order to define the scope of a project whose aim it is to describe the entire heritage domain – or at least large chunks of it – within Wikidata (and accessorily within Wikimedia Commons and Wikisource). Given such a framework, it becomes possible to set goals for the different areas and to systematically track progress across the various projects contributing to the overall objective.

Based on a tentative typology of the scope of heritage data platforms (which is partly inspired by the categorization of cultural heritage objects used in the ENUMERATE and the OpenGLAM Benchmark Surveys), the manifestations of cultural heritage can be categorized as follows:

Level 1 Level 2 Level 3
heritage objects text based resources books
manuscripts
autographs
periodicals
newspapers
two-dimensional visual resources drawings
paintings
engravings
prints
photographs
posters
sheet music
maps
archival resources official documents
archival records
three-dimensional man-made movable objects three-dimensional works of art
furnishings and equipment
craft artefacts
coins and medals
toys
objects of daily use
natural resources natural inert specimens
natural living specimens
geography based resources monuments and buildings (including larger ships)
landscapes
archeological sites
time based resources audio documents
film documents
video recordings
digital interactive resources databases
digital three-dimensional designs or reconstructions of objects and buildings
born-digital art objects
digital research files
GIS files
games
software
websites or parts thereof
cultural performances living traditions inventories of living traditions
performing arts inventories of performing arts productions

With regard to the heritage objects, coverage can be expressed in terms of depth of information, according to the following hierarchy (inspired by a recent, at the time of writing still unpublished draft of a Whitepaper on Archival Data Portals by the Association of Swiss Archivists):

  1. inventories of institutions holding heritage objects (heritage institutions)
  2. summary descriptions of heritage institutions’ holdings
  3. inventories of heritage objects (metadata)
  4. heritage content (digitized objects / digital-born material)
  5. searchable content (OCR, transcriptions)
  6. datafication at the content level (entity extraction, semantic tagging of content)

With regard to the performing arts, coverage can be expressed in terms of depth of information, according to the following hierarchy:

  1. inventories of organizations, places (e.g. stages), and events (e.g. festivals) hosting cultural performances
  2. inventories of performing arts productions (with pointers to the works performed, place, time, type of performance, and all the people and collectives contributing to the production)

When it comes to digital artifacts of heritage content, the following media types can be distinguished:

  • image
  • audio
  • audio/video
  • text
  • text/images
  • interactive online media (multimedia)
  • databases
  • 3D models

In addition, the various data ingestion projects may be limited along the following dimensions (they are inspired by the limitations observed in cultural heritage data portals as well as on the existing Wikiprojects listed above):

  • geographical scope
  • epoch / time scope
  • topic areas
  • relevance / notability

Ingesting comprehensive inventories of heritage institutions seems to be a promising entry point when it comes to systematically describing the heritage domain, as they can provide a link between many of the Wikiprojects listed above.

Field Notes: Ingesting Data from the Swiss GLAM Inventory edit

In this section you will find the notes I took while carrying out the ingestion project, roughly following the procedure layed out above.

Get acquainted with the data structure in the source file and in the target environment (Wikidata) edit

Source file: Swiss GLAM Inventory (datahub.io)

Do a first mapping of data fields / propose new properties where needed edit

Class Property (in datafile) Property (Wikidata) Refers to class (Wikidata) / possible values Remarks
Institution / organization (Q43229) GLAM-ID GLAM ID (P3066)   GLAM ID (Q25839974)      The "GLAM ID" property was newly proposed on 12 July 2016 and created on 9 August 2016.
Wikidata Wikidata Q-Number (Q number)
isPartOf_WD part of (P361)   organization (Q43229)     
Institution-type_1 (= 'museum') instance of (P31)   museum (Q33506)      The definitions on WD need some disentangling: archive is presently listed as subclass of library; the definitions of archive and library are less specific in English than in other languages; there is generally an unfortunate mix between 'organization' and 'building'.
Institution-type_1 (= 'library') library (Q7075)     
Institution-type_1 (= 'archive') archive (Q166118)     
Institution-type_1 (= 'combination') museum (Q33506)     library (Q7075)     

archive (Q166118)     

Here, multiple attributes need to be added on a case by case basis.
Institution-type_1 (= 'other')
Institution-type_2 instance of (P31)   A controlled vocabulary needs to be developed (cf. ISIL types).
Institution-type_2 (= 'regional or local museum') instance of (P31)   local museum (Q1595639)     
Institution-type_2 (= 'museum of arechaeology') archaeological museum (Q3329412)     
Institution-type_2 (= 'museum of art') art museum (Q207694)     
Institution-type_2 (= 'museum of ethnography or anthropology') ethnographic museum (Q26944892)     
Institution-type_2 (= 'museum of history') history museum (Q16735822)     
Institution-type_2 (= 'museum of natural history or natural science') science museum (Q588140)     
Institution-type_2 (= 'museum of science or technology') technology museum (Q2398990)     
Institution-type_2 (= 'museum with a theme') Items with Institution-type_2 "museum with a theme" don't get an extra P31 statement.
Institution-type_2 (= 'national library') instance of (P31)   national library (Q22806)  
Institution-type_2 (= 'state library') cantonal library (Q678405)  
Institution-type_2 (= 'municipal library') municipal library (Q2326815) 
Institution-type_2 (= 'higher education library') academic library (Q856234)  
Institution-type_2 (= 'monastery or church library') no label (Q1776381)  The items need to be assigned to either of the two classes.
church library (Q27030553)     
Institution-type_2 (= 'special library (thematic library)') special library (Q385994) 
Institution-type_2 (= 'other type of library') Items with Institution-type_2 "other type of library" don't get an extra P31 statement.
Institution-type_2 (= 'national archive') instance of (P31) national archives (Q15303967)     
Institution-type_2 (= 'state archive') cantonal archive (Q2860410)     
Institution-type_2 (= 'municipal archive') municipal archive (Q604177)     
Institution-type_2 (= 'monastery or church archive') monastery archive (Q27030561)      The items need to be assigned to either of the two classes.
church archive (Q27030746)     
Institution-type_2 (= 'audio-visual / broadcasting archive') audio-visual archive (Q27030766)     
Institution-type_2 (= 'company archive') company archives (Q27030778)     
Institution-type_2 (= 'archive of a government agency') public archive (Q27031009)     
Institution-type_2 (= 'archive of an international organization') archive of an international organization (Q27031014)     
Institution-type_2 (= 'archive of a non-commercial organization') association archive (Q27030820)      The items need to be assigned to one of the sub-classes.
foundation archive (Q27030827)     
hospital archive (Q27030837)     
university archive (Q27030870)     
school archive (Q27030883)     
public archive (Q27031009)     
political archive (Q27030921)     
Institution-type_2 (= 'special archive (thematic archive)') specialized archive (Q27030941)      The items need to be assigned to one of the sub-classes.
performing arts archive (Q27030945)     
economic archive (Q27032167)     
art archive (Q27032254)     
scientific archive (Q27032095)     
etc.
Institution-type_2 (= 'other archive / records office') Items with Institution-type_2 "other archive / records office" don't get an extra P31 statement.
Institution-type_2 (= 'film institute') instance of (P31) cinematheque (Q1352795)     
Institution-type_2 (= 'institution for monument care') office for the preservation of historical monuments (Q1188075)     
Institution-type_2 (= 'archaeological service') archaeological service (Q27031016)     
Institution-type_2 (= 'other type of institution')
hasNoHeritageCollection (!= 1) instance of (P31)   memory institution (Q1497649)     
inExistenceSince inception (P571)   (date)
significant event (P793)   opening (Q15051339)     

reopening (Q16571590)     

(date as qualifier)
sourceOpening (reference URL (P854)   as qualifier for statements regarding 'inExistenceSince')
inExistenceUntil dissolved, abolished or demolished date (P576)   (date)
significant event (P793)   closure (Q5135520)      (date as qualifier)
renovation (Q2144402)      (start time and end time as qualifiers)
sourceClosure (reference URL (P854)   as qualifier for statements regarding 'inExistenceUntil')
Designation Lde (text) Labels in various languages
Lfr (text)
Lit (text)
Len (text)
Ade (text) Aliases in various languages
Afr (text)
Ait (text)
Aen (text)
isPartOf part of (P361)   organization (Q43229)     
Postal_address_line1 There seems to be no possibility to represent this in Wikidata at the moment. Maybe there should be a separate property for 'postal address' that can be used in cases where the postal address differs from the address of the physical location and/or where the first lines of a postal address do not correspond to an institution's name (e.g. local museum where the postal address points to the private address of the president of the local history association).

In case the postal address comprises a PO Box, post office box (P2918)   can be used, with postal code (P281)   as qualifier.

Postal_address_line2
Postal_address_line3
Postal_address_street_and_no
Postal_address_complement
Postal_address_zip_code
Postal_address_town
Physical_location_address 969 Search (text) Possible alternative:

located on street (P669)   (with street number as qualifier)

Physical_location_zip_code postal code (P281)   (number)
Physical_location_municipality This field may point to an official municipality (in which case it is equivalent to 'Municipality') or to a part of a municipality or a former municipality.
Language This may be rendered as official language (P37)  . The field is however mostly useful to disentangle the labels in the various languages (see 'Designation').
Website official website (P856)   (URL)
PCP-No PCP reference number (P381)   (number) PCP reference number of the collection (and not of the building!)
PCP-Category (= 'A') heritage designation (P1435)   class A Swiss cultural property of national significance (Q8274529)      PCP category of the collection (and not of the building!)
PCP-Category (= 'B') class B Swiss cultural property of regional significance (Q12126757)     
hasPart_(PCP-No) has part(s) (P527)   collection (Q2668072)     

organization (Q43229)     

isPartOf_(PCP-No) part of (P361)   organization (Q43229)     
Municipality located in the administrative territorial entity (P131)   municipality of Switzerland (Q70208)     
Canton located in the administrative territorial entity (P131)   canton of Switzerland (Q23058)     
Country country (P17)   country (Q6256)     
(GND ID) GND ID (P227)   (text) Not presently part of the data file. May be added later.
(ISIL ID) ISIL (P791)   (text) Not presently part of the data file. May be added later.
(Coordinates) coordinate location (P625)   (coordinates) Not presently part of the data file. May be added later based on 'Physical_location_address'.

Note: Compared to the RDF rendering of the same data, the data structure on Wikidata is 'flatter':

  • The collection is not modeled as a class separate from the institution.
  • schema.org uses a nested structure to model place / postal address.

Find out what data is already present on Wikidata edit

Use the Listeria Tool to visualize existing data from Wikidata in a table on the Wiki edit

  • List of Swiss Heritage Institutions Note: so far, I haven't found a good way to disentangle municipalities and cantons in the Listeria tool as they both relate to the same property.

Write SPARQL Queries to extract existing data from Wikidata edit

Note: Use the Chrome browser for longer queries; it seems to run smoother than Firefox. There seems to be a limit as to the number of arguments you can put into a SPARQL query to Wikidata (see: Discussion).

Heritage institutions in Switzerland edit
  • tinyurl.com/zpw99ly (a simple query listing the heritage institutions in Switzerland)
  • tinyurl.com/ha2pukb (outputting the labels of the various attributes in addition)
  • tinyurl.com/gr63dpp (outputting descriptions in addition)
  • tinyurl.com/jag5h79 (refining the 'located in Switzerland' part)
  • tinyurl.com/joevvaa (concatenating multiple values for the same variable; see: How to concatenate a list of values in SPARQL?)
SPARQL Query
SELECT ?item
       ?Label_en
       ?Description_en
       (group_concat(distinct ?Alias_en;separator="; ") as ?Aliases_en)    #Concatenate the values in order not to get several rows per item.
       ?Label_de
       ?Description_de
       (group_concat(distinct ?Alias_de;separator="; ") as ?Aliases_de)
       (replace(group_concat(distinct ?Type;separator="; "), "http://www.wikidata.org/entity/", "") as ?Types)  #Strip the path in order to get only the Q-number.
       (group_concat(distinct ?TypeLabel_en;separator="; ") as ?TypeLabels_en)
       (replace(group_concat(distinct ?Municipality;separator="; "), "http://www.wikidata.org/entity/", "") as ?Municipalities)   
       (group_concat(distinct ?MunicipalityLabel_de;separator="; ") as ?MunicipalityLabels_de)
       (replace(group_concat(distinct ?Canton;separator="; "), "http://www.wikidata.org/entity/", "") as ?Cantons)
       (group_concat(distinct ?CantonLabel_de;separator="; ") as ?CantonLabels_de)
       (group_concat(distinct ?PCPNo;separator="; ") as ?PCPNos)
  WHERE {
  {?item wdt:P31 ?museum . ?museum wdt:P279* wd:Q33506 } 
    UNION {?item wdt:P31 ?archive . ?archive wdt:P279* wd:Q166118 } 
    UNION {?item wdt:P31 ?library . ?library wdt:P279* wd:Q7075} . 
  {?item wdt:P131/wdt:P17 wd:Q39} .  
  OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
  OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
  OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
  OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
  OPTIONAL { ?item skos:altLabel ?Alias_en . FILTER (lang(?Alias_en) = "en") }
  OPTIONAL { ?item skos:altLabel ?Alias_de . FILTER (lang(?Alias_de) = "de") } 
  OPTIONAL { ?item skos:altLabel ?Alias_fr . FILTER (lang(?Alias_fr) = "fr") }
  OPTIONAL { ?item skos:altLabel ?Alias_it . FILTER (lang(?Alias_it) = "it") }
  OPTIONAL { ?item schema:description ?Description_en . FILTER (lang(?Description_en) = "en") }
  OPTIONAL { ?item schema:description ?Description_de . FILTER (lang(?Description_de) = "de") } 
  OPTIONAL { ?item schema:description ?Description_fr . FILTER (lang(?Description_fr) = "fr") }
  OPTIONAL { ?item schema:description ?Description_it . FILTER (lang(?Description_it) = "it") }
  OPTIONAL { ?item wdt:P31 ?Type. }
  OPTIONAL { ?item wdt:P31/rdfs:label ?TypeLabel_en . FILTER (lang(?TypeLabel_en) = "en") }  
  OPTIONAL { ?item wdt:P131 ?Municipality . FILTER EXISTS {?Municipality wdt:P31 wd:Q70208}}  
  OPTIONAL { ?item wdt:P131/rdfs:label ?MunicipalityLabel_de . FILTER EXISTS {?MunicipalityLabel_de ^rdfs:label/wdt:P31 wd:Q70208} . FILTER (lang(?MunicipalityLabel_de) = "de")}
  OPTIONAL { ?item wdt:P131 ?Canton . FILTER EXISTS {?Canton wdt:P31 wd:Q23058}}
  OPTIONAL { ?item wdt:P131/rdfs:label ?CantonLabel_de . FILTER EXISTS {?CantonLabel_de ^rdfs:label/wdt:P31 wd:Q23058} . FILTER (lang(?CantonLabel_de) = "de") }  
  OPTIONAL { ?item wdt:P17 ?Country. }
  OPTIONAL { ?item wdt:P17/rdfs:label ?CountryLabel_de . FILTER (lang(?CountryLabel_de) = "de") }
  OPTIONAL { ?item wdt:P381 ?PCPNo. } 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }    
}
group by ?item                          #List all the variables for which the values are not concatenated!
         ?Label_en ?Description_en 
         ?Label_de ?Description_de 
         ?Label_fr ?Description_fr 
         ?Label_it ?Description_it
tinyurl.com/zhdke63 (all GLAMs and memory institutions in Switzerland and Liechtenstein)
Swiss Municipalities edit

The following query uses these:

  • Properties: instance of (P31)     , Swiss municipality code (P771)     
    SELECT ?item ?FSO_code ?Label_en ?Label_de ?Label_fr ?Label_it WHERE {
    {?item wdt:P31 wd:Q70208}.
    {?item wdt:P771 ?FSO_code}.   
      OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
      OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
      OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
      OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
    }
    

http://tinyurl.com/jv8zkay

Swiss Municipalities with Cantons edit

The following query uses these:

http://tinyurl.com/he4wq59

Cantons of Switzerland (including former cantons) edit

The following query uses these:

  • Properties: instance of (P31)     
    SELECT ?item ?Label_en ?Label_de ?Label_fr ?Label_it WHERE {
    {?item wdt:P31 wd:Q23058}.
      OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
      OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
      OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
      OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
    }
    

http://tinyurl.com/zth7zqo (contains former cantons)

Cantons of Switzerland (only present-day cantons, with ISO codes) edit

The following query uses these:

  • Properties: instance of (P31)     , ISO 3166-2 code (P300)     
    SELECT ?item ?ISO_code ?Label_en ?Label_de ?Label_fr ?Label_it WHERE {
    {?item wdt:P31 wd:Q23058}.
    {?item wdt:P300 ?ISO_code}.   
      OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
      OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
      OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
      OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
    }
    

http://tinyurl.com/ztsunud (only present-day cantons, with ISO codes)

Sub-classes of heritage institutions edit

The following query uses these:

  • Properties: subclass of (P279)     
    SELECT ?item
           (group_concat(distinct ?superClassLabel_en;separator="; ") as ?superClassLabels_en)    #Concatenate the values in order not to get several rows per item.
           (replace(group_concat(distinct ?superClass;separator="; "), "http://www.wikidata.org/entity/", "") as ?superClasses)   #Strip the path in order to get only the Q-number.        
    	   ?Label_en
           (group_concat(distinct ?Alias_en;separator="; ") as ?Aliases_en)    #Concatenate the values in order not to get several rows per item.
           ?Label_de
           (group_concat(distinct ?Alias_de;separator="; ") as ?Aliases_de)
           ?Label_fr
           (group_concat(distinct ?Alias_fr;separator="; ") as ?Aliases_fr)
           ?Label_it
           (group_concat(distinct ?Alias_it;separator="; ") as ?Aliases_it)
           ?Label_es
           (group_concat(distinct ?Alias_es;separator="; ") as ?Aliases_es)
           ?Label_ru
           (group_concat(distinct ?Alias_ru;separator="; ") as ?Aliases_ru)
           ?Description_en       
           ?Description_de
           ?Description_fr
           ?Description_it
           ?Description_es
           ?Description_ru
    WHERE {
      {?item wdt:P279+ wd:Q33506} UNION {?item wdt:P279+ wd:Q166118} UNION {?item wdt:P279+ wd:Q7075}.
      OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
      OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
      OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
      OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
      OPTIONAL { ?item rdfs:label ?Label_es . FILTER (lang(?Label_es) = "es") }
      OPTIONAL { ?item rdfs:label ?Label_ru . FILTER (lang(?Label_ru) = "ru") }
      OPTIONAL { ?item skos:altLabel ?Alias_en . FILTER (lang(?Alias_en) = "en") }
      OPTIONAL { ?item skos:altLabel ?Alias_de . FILTER (lang(?Alias_de) = "de") } 
      OPTIONAL { ?item skos:altLabel ?Alias_fr . FILTER (lang(?Alias_fr) = "fr") }
      OPTIONAL { ?item skos:altLabel ?Alias_it . FILTER (lang(?Alias_it) = "it") }
      OPTIONAL { ?item skos:altLabel ?Alias_es . FILTER (lang(?Alias_es) = "es") }
      OPTIONAL { ?item skos:altLabel ?Alias_ru . FILTER (lang(?Alias_ru) = "ru") }
      OPTIONAL { ?item schema:description ?Description_en . FILTER (lang(?Description_en) = "en") }
      OPTIONAL { ?item schema:description ?Description_de . FILTER (lang(?Description_de) = "de") } 
      OPTIONAL { ?item schema:description ?Description_fr . FILTER (lang(?Description_fr) = "fr") }
      OPTIONAL { ?item schema:description ?Description_it . FILTER (lang(?Description_it) = "it") }
      OPTIONAL { ?item schema:description ?Description_es . FILTER (lang(?Description_es) = "es") }
      OPTIONAL { ?item schema:description ?Description_ru . FILTER (lang(?Description_ru) = "ru") }
      OPTIONAL { ?item wdt:P279 ?superClass }
      OPTIONAL { ?item wdt:P279/rdfs:label ?superClassLabel_en . FILTER (lang(?superClassLabel_en) = "en") }
    }
    group by ?item  						#List all the variables for which the values are not concatenated!
             ?Label_en ?Description_en 
             ?Label_de ?Description_de 
             ?Label_fr ?Description_fr 
             ?Label_it ?Description_it
             ?Label_es ?Description_es
             ?Label_ru ?Description_ru
    

http://tinyurl.com/ht5fvuc

Sub-classes of heritage institutions, with the different levels of super-classes edit

The following query uses these:

  • Properties: subclass of (P279)     
    SELECT ?item
           (group_concat(distinct ?institutionType1_en;separator="; ") as ?institutionTypes1_en)       
           (group_concat(distinct ?institutionType2_en;separator="; ") as ?institutionTypes2_en)
           (group_concat(distinct ?institutionType3_en;separator="; ") as ?institutionTypes3_en)
           (group_concat(distinct ?institutionType4_en;separator="; ") as ?institutionTypes4_en)
           ?Label_en
           (group_concat(distinct ?Alias_en;separator="; ") as ?Aliases_en)    #Concatenate the values in order not to get several rows per item.
           ?Label_de
           (group_concat(distinct ?Alias_de;separator="; ") as ?Aliases_de)
    WHERE {
      {?item wdt:P279+ wd:Q33506} UNION {?item wdt:P279+ wd:Q166118} UNION {?item wdt:P279+ wd:Q7075}.
      OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
      OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
      OPTIONAL { ?item skos:altLabel ?Alias_en . FILTER (lang(?Alias_en) = "en") }
      OPTIONAL { ?item skos:altLabel ?Alias_de . FILTER (lang(?Alias_de) = "de") } 
      OPTIONAL { ?item wdt:P279+ ?institutionType1 . ?institutionType1 wdt:P279/^wdt:P279 wd:Q33506 . ?institutionType1 rdfs:label ?institutionType1_en . FILTER (lang(?institutionType1_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType1 . ?institutionType1 wdt:P279/^wdt:P279 wd:Q166118 . ?institutionType1 rdfs:label ?institutionType1_en . FILTER (lang(?institutionType1_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType1 . ?institutionType1 wdt:P279/^wdt:P279 wd:Q7075 . ?institutionType1 rdfs:label ?institutionType1_en . FILTER (lang(?institutionType1_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType2 . ?institutionType2 wdt:P279 wd:Q33506 . ?institutionType2 rdfs:label ?institutionType2_en . FILTER (lang(?institutionType2_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType2 . ?institutionType2 wdt:P279 wd:Q166118 . ?institutionType2 rdfs:label ?institutionType2_en . FILTER (lang(?institutionType2_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType2 . ?institutionType2 wdt:P279 wd:Q7075 . ?institutionType2 rdfs:label ?institutionType2_en . FILTER (lang(?institutionType2_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType3 . ?institutionType3 wdt:P279/wdt:P279 wd:Q33506 . ?institutionType3 rdfs:label ?institutionType3_en . FILTER (lang(?institutionType3_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType3 . ?institutionType3 wdt:P279/wdt:P279 wd:Q166118 . ?institutionType3 rdfs:label ?institutionType3_en . FILTER (lang(?institutionType3_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType3 . ?institutionType3 wdt:P279/wdt:P279 wd:Q7075 . ?institutionType3 rdfs:label ?institutionType3_en . FILTER (lang(?institutionType3_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType4 . ?institutionType4 wdt:P279/wdt:P279/wdt:P279 wd:Q33506 . ?institutionType4 rdfs:label ?institutionType4_en . FILTER (lang(?institutionType4_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType4 . ?institutionType4 wdt:P279/wdt:P279/wdt:P279 wd:Q166118 . ?institutionType4 rdfs:label ?institutionType4_en . FILTER (lang(?institutionType4_en) = "en") }
      OPTIONAL { ?item wdt:P279+ ?institutionType4 . ?institutionType4 wdt:P279/wdt:P279/wdt:P279 wd:Q7075 . ?institutionType4 rdfs:label ?institutionType4_en . FILTER (lang(?institutionType4_en) = "en") }
    }
    group by ?item  						#List all the variables for which the values are not concatenated!
             ?Label_en 
             ?Label_de
    

http://tinyurl.com/jvfn4tz (with the different levels of super-classes)

Monitor the creation of new Wikipedia articles resulting in Wikidata items edit

When editors create new Wikipedia articles on heritage institutions they will not show up in our SPARQL queries unless someone actually defines an "instance of" property in Wikidata, which is often not the case. Instead, Wikipedians will often just add Wikipedia categories. In order to track such newly created articles, a bot can be configured to list them on a special page on Wikipedia.

Introduce a unique identifier in both datasets in order to map existing data on Wikidata to the source file edit

  • Blog Post about how to merge two datasets in OpenRefine
  • Reconcile CSV Tool Note: The Reconcile tool omits "doublettes" in the data which result from generating datasets from SPARQL queries whenever a field has more than one value (e.g. instance of "building" and instance of "museum" or institutions situated in several municipalities/cantons). It may therefore be worthwhile removing such double entries before running the Reconcile tool.

Analyse special cases and identify data modelling issues edit

Initial notes (just a beginning, nothing systematic...)

Systematic analysis of twelve items of the Wikidata query (items 1, 11, 21, etc.) out of the 230 items that are listed in order of the Q-Number, i.e. in the order of their creation (the last item being mountain guide museum (Q20853359): Bergführermuseum, St. Niklaus):

  • Cantonal museum of Zoology, Lausanne (Q14630553): Official name of the institution (according to Website): "Musée de zoologie Lausanne" (logo) or "Musée cantonal de zoologie" (texte). German label shortened to "Zoologiemuseum" (should the label be translated into German?); French label starting with a small letter, although it's a proper name. The French description repeats the label, this time starting with a capital letter. The Commons Category points to the building, instead of pointing to the correct Commons category for the institution. The entry contains an inception date, citing WP-fr as a source; for the statement in WP-fr, an external source is cited, but this source (a cached web page) does not contain the wanted information (the actual information is on a sub-page which doesn't seem to be cached). The different name authority files list the following labels: Musée de zoologie (Lausanne, Suisse) (BnF, SUDOC); Musée de zoologie (Lausanne) (GND, citing Heleticat); Musée cantonal de zoologie (Lausanne, Switzerland) (LoC). The Wikidata label seems to follow some Wikipedia lemma conventions and not the name conventions of the authority files.
  • ABB Switzerland, archive collection (Q19362289): All entries seem to be ok. The description doesn't mention that it is an "archive", but a "cultural property of national significance". Uses "location" (instead of P131 "located in the administrative territorial entity") to point to the municipality.
  • Schweizerische Theatersammlung (Q19362509): Again, the description doesn't mention the type of institution (archive, museum, library and documentation centre), but the fact that it is a "cultural property of national significance". The English label gives the institution's name in German. It has the wrong statement "instance of": "German Federal Archives". Uses "location" (instead of P131 "located in the administrative territorial entity") to point to the municipality.
  • Swiss National Library (Q201787): Has two PCP reference numbers (both A category; this is an oddity of the PCP inventory, as it doesn't even give different labels). Has two VIAF reference numbers (one under the present and one under the former official name). Contains a "GLAM-ID" without citing a reference. PCP reference numbers are also not referenced. The entry HDS ID mentions the former official name in German (Schweizerische Landesbibliothek), although the current version of HDS gives the current official name (Schweizerische Nationalbibliothek). How to model 'former names' in Wikidata? Uses "location" (instead of P131 "located in the administrative territorial entity") to point to the municipality.
  • Cantonal library of Appenzell Ausserrhoden (Q539405): The description in English ("library") is rather short. Uses "location" and "located on terrain feature" (instead of P131 "located in the administrative territorial entity") to point to the municipality.
  • Swiss Museum of Transport (Q670595): The name of the entry, and the English label seem wrong. The official name of the institution in English would be "Swiss Museum of Transport" (according to the institution's website, which also corresponds to the lemma of the English WP article). The German label contains a shortened version of the official name. The French label is not capitalized, although it refers to a proper name. The descriptions (if existing) are minimalist.
  • open-air museum Ballenberg (Q680419): The French label doesn't correspond to the official name, but to a description of the museum.
  • Migros Museum of Contemporary Art (Q1380528): Partly missing labels and descriptions; otherwise ok.
  • canton Thurgau library (collection) (Q1440938): Gives the PCP reference number "5017" which is most likely the one for the building. Doesn't give the two PCP reference numbers for the collection: 9346 (A object), 8932 (B object). Rightly states that it is a Swiss cultural property of national significance and a Swiss cultural property of regional significance (according to the PCP inventory).
  • Ausstellungsraum Klingental (Q780609): Ausstellungsraum Klingental doesn't seem to be a heritage institution (no collection of its own); it is nevertheless described as a "museum".
  • Q1053580: Doesn't seem to be a heritage institution (no collection of its own). It hosts the headquarters of a natural park (Naturpark Beverin). Should probably not have the statement "instance of": "museum".
  • Prehistorical museum, Zug (Q1299354): Partly missing labels and descriptions; otherwise ok.

Further issues encountered when systematically tidying up already existing data on Wikidata after data ingestion:

  • To what extent should parentheses in labels be used for disambiguation purposes? (cf. Remark on this User Talk page)
  • described at URL (P973) This is an interesting property which I haven't actively used so far. Example: Museum of Horsepower (Basel Historical Museum) (Q27837185)
  • MASI Lugano came into existence through the merger of two museums [2]. This still needs to be added and properly modeled.
  • There are numerous instances where the categorization of an institution as “museum” is debatable. Cf. ICOM definition of “museums”: “Museum. A museum is a non-profit, permanent institution in the service of society and its development, open to the public, which acquires, conserves, researches, communicates and exhibits the tangible and intangible heritage of humanity and its environment for the purposes of education, study and enjoyment.” (ICOM, 2007 definition) (See: ICOM Definition, History of the ICOM Definition, background articles about reevaluating the ICOM definition; an interesting resource with regard to ontology development in the museum sphere is also the document Key Concepts of Museology issued by ICOM in various languages). In general, an inclusive approach was used, especially with regard to public accessibility. Examples include: Zivilschutz-Museum (Q206929) (The Zivilschutzmuseum Zürich is only open to the public on a few days per year); Welschdörfli (Q2252185) (To visit Schutzbauten Welschdörfli in Chur, visitors need to get the key at the local tourist office); Q3329997 (The Musée de la Bière Cardinal can be visited only on prior demand). When it comes to the requirement of having its own collection, practice is not univocal: Creaviva (Q1139387) - The Children’s museum Creaviva is a children’s department within a larger art museum; it does not have its own collection; but it is nevertheless described as instance of “museum”. On the other hand, for Staufferhaus Local Museum (Q2334559) (The Staufferhaus Unterentfelden seems to be used as a local cultural centre; and the status as a museum is therefore dubious) and Archizoom (Q16303827) (Archizoom is not a museum, but a collective providing exhibition space for temporary exhibits - in the given case on a specific topic) the property instance of "museum" has been removed. There seems to be no proper vocabulary yet on WD to describe the latter type of institution. A similar case would be Stapferhaus Lenzburg (Q2332095) (Stapferhaus Lenzburg). Also, the question whether a museum automatically is also a "memory institution" (in case the requirement of having its own heritage collection is upheld) still needs to be discussed. Alternatively, a separate statement "instance of" "memory institution" can used to distinguish museums that have their own collection from the rest. This class of institutions has also been used to distinguish between libraries with heritage collections from those without.
  • There are also examples that raise questions about how to describe museums that are (being) dissolved: Bahnmuseum Kerzers (Q801775) - The Bahnmuseum Kerzers-Kallnach is in the process of dissolution. It is described as instance of “museum”, but there is no indication yet for “dissolution”; according to some press articles dating from 2014 the museum has to close by 2017. The website already has been de-activated, and is questionable whether there will be a reliable source indicating the exact date or year of closure. How should we proceed in this case? Cocteau Kabinett (Q1105477) - Cocteau-Kabinett is a museum which closed in 2005. It is described as instance of “museum”. The year of dissolution is declared. Should the description be changed to “former museum in…”? It should also be kept in mind that when writing queries to get present-day museums, such special cases should be accounted for. The same goes for institutions that used to be located in a given country, but have moved to a different country since. Example: Hector Hodler Library (Q620999).
  • Some items refer to “collections” within an institution. If, for data modelling purposes, it appears useful to distinguish between the institution and (one of) its collection(s), a separate item should be created for the collection. Question: Should a particular sub-class of collections be created to be used in the context of heritage institutions? e.g. “heritage collection”? Examples: historical collection of the Museum of the Canton of Aargau (Q19362343) - The historical collection of the Museum of the Canton of Aargau figures in the directory of protected cultural properties in Switzerland where its location is indicated to be a the Lenzburg Castle. The Museum it belongs to comprises however several castles, its seat being at the Wildegg Castle; Fondation Egon von Vietinghoff (Q3075437) (Fondation Egon von Vietinghoff) - the item refers to a private art collection; manuscript and music collection of the Basel University Library (Q19362826); Bibliothèque des Cèdres (Q2901390) (bibliothèque des Cèdres in Lausanne) - This is a tricky one: it used to be an independent library (under a different name), was later integrated in BCUL Lausanne where it was located in building on Cèdres street from which it got its present name. The building was later re-affected and the collection brought to a storage room. At present it is described as a "collection". Obviously, one would need to think about how to model the past events, the re-naming, etc. See also: See also: http://www.vd.ch/fileadmin/user_upload/organisation/gc/fichiers_pdf/10_QUE_014_11_QUE_017_Texte_CE.pdf
  • When to attribute an item “institution” status, and when to attribute it “collection” status is not always easy to decide if the item belongs to or is attached to a larger institution; I opted for a pragmatic approach: Whenever the “collection” is marketed to the outside as a special archive or a separate museum, I described it as an “institution” that is part of another institution, Examples: Hans Erni Museum (Q3329191) - The Hans-Erni-Museum is an art museum which forms part of the Swiss Museum of Transport. Graphische Sammlung der ETH (Q27490235) - The Graphische Sammlung of the ETH Library is listed in the museums inventory of the Swiss Museums Association; therefore it has been described as instance of “museum”, as instance of “collection” and as part of “ETH Library”. Centre de documentation et d'étude sur la langue internationale (Q2296824) The Centre de documentation et d'étude sur la langue internationale is integrated into the La Chaux-de-Fonds municipal library.
  • historical workshop Mulin (Q1620920): The museum is located in a municipality (Schnaus) that has merged into another one (Ilanz/Glion). How should this be modelled? Do we want to preserve the information regarding the former municipality? Or should we just replace the entry? (keeping the information, e.g. by adding an end qualifier to the administrative territorial entity, allows for richer information; but it will also make querying the data more complex.)

Overview of the main data modelling issues edit

The table below contains an overview of the most common data modelling issues that were encountered and how they can be resolved.

Issue Description Example Approach Chosen Examples
Confusion between building and institution The same WD item is used to describe a building and an institution/collection. Weierbachhaus Create two separate items for the building and the institution. Reference the building as a "location" of the institution.

Make sure that the various identifiers and authority references are attached to the right item.

No separate items are created for institutions and their collections unless this is deemed necessary due to particular circumstances.

Ortsmuseum Eglisau (Q27891193) /

Weierbach House (Q20012710)

Differing practices regarding the labels No common practice is followed across items and languages regarding the label of an item. Use the following format for labels: [Official Name] OR [Translated Name] ([Municipality], [Country]).

Municipality and country are indicated if they are not already part of the official name and if the official name is generic enough to allow for confusion with other entities. The official name may be kept in the original language or translated into the language of the label (if an official translation exists, the official translation should be used), whereas the municipality and the country are always indicated in the language of the label.

Proper names are capitalized according to the rules that apply in the different languages.

Basel Historical Museum (Q386286) Monuments Preservation Service of the Canton of Zurich (Q27480035)
Differing practices regarding the descriptions No common practice is followed across items and languages regarding the description of an item. Be as specific as possible regarding the institution type. Indicate the name of the municipality, with the country name in brackets. The primary focus should be on the institution type, and not on the fact whether the heritage collection is a cultural property of national or regional significance.

If the institution has ceased to exist, write "Former museum in..." etc.

Cocteau Kabinett (Q1105477)
A wrong property is used to refer to the municipality Properties other than located in the administrative territorial entity (P131) are used to refer to the municipality Cantonal Library of Appenzell Ausserrhoden Exclusively use located in the administrative territorial entity (P131) to refer to municipalities. Cantonal library of Appenzell Ausserrhoden (Q539405)

Get / improve thesauri for the values to be entered on Wikidata edit

Inspect the thesauri that already exist on Wikidata (municipalities, cantons, types of heritage institutions) edit

Upon first inspection it appears that there is quite good coverage of Swiss municipalities and Swiss cantons (with unique identifiers): See Wikidata SPARQL queries above. If labels are to be used in tables, the labels of the cantons should be harmonized in some of the languages.

Regarding the typology of heritage institutions, it seems that there is a lot of clean-up work to be done as there is quite some incoherence in the existing data (see Wikidata SPARQL query above). At the same time, there are already many existing items to build a thesaurus on.

Inspect thesauri that exist outside of Wikidata (types of heritage institutions) edit

Create new types where needed. See: Typology of heritage institutions

Analyze the present usage of the thesauri in the context of Swiss heritage institutions edit

In ca. 60% of cases, the basic type (museum, archive, library) is used in the statement "instance of", sometimes in combination with another class from outside the GLAM realm (e.g. church).

In ca. 5% of cases, the basic type (museum, archive, library) is used in combination with one of its sub-classes (e.g. museum & art museum; library & special library).

In ca. 35% of cases, one of the sub-classes is used, without indicating the basic type.

In a couple of cases, the type "college library" is used to refer to a university library.

Model the data source(s) on Wikidata edit

Create an item for the data source on datahub.io (see: Help:Sources): Swiss GLAM Inventory (Q26933296)     

Add an item for the specific release of the dataset: Swiss GLAM Inventory, 16 September 2016 (Q27477970) (due to limitations of the Quick Statement Tools, see below).

Clean up existing data on Wikidata edit

Largely done. See notes above regarding data modelling issues.

Add new statements to Wikidata edit

Create new items edit

Add new items to Wikidata using the Quick Statements Tool.

Code:

CREATE
LAST	Lde	"Museum und Römerhaus Augusta Raurica"
LAST	Dde	"Archäologisches Museum in Augst (Schweiz)"
LAST	P3066	"CH-000078"
LAST	P31	Q33506	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P31	Q3329412	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P31	Q1497649	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P969	"Giebenacherstrasse 17"	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11	
LAST	P281	"4302"	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P856	"http://www.augusta-raurica.ch"	
LAST	P381	"8469"	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P1435	Q8274529	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P131	Q66515	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P131	Q12146	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11
LAST	P17	Q39	S248	Q26933296	P577	+2016-06-19T00:00:00Z/11

Example item:

Issues:

  • Issue 1: The corresponding item (Roman museum (Q2179606)) already exists. Somehow it wasn't identified during the matching process, probably due to a erroneous SPARQL query that left out the sub-categories of museums, libraries, and archives. Corrective action: re-run the matching process (e.g. also on PCP identifiers) / merge the two items.
  • Issue 2: the publication date of the reference is not properly rendered. The functionalities of the Quick Statements Tool are still limited and should be further extended in an overhaul. See the post on Magnus' talk page and the discussion on the Wikidata mainling list. Workaround: Create a different WD item for each new release of the dataset as a workaround to the present limitations of the Quick Statements Tool.

Updated Code:

CREATE
LAST	Lde	"Historische Militäranlagen Freiburg/Bern"
LAST	Dde	"historisches Museum in Aarberg (Schweiz)"
LAST	Lfr	""
LAST	Lit	""
LAST	Len	""
LAST	Den	"historical museum in Aarberg (Switzerland)"
LAST	P3066	"CH-000010"	S248	Q27477970	
LAST	P31	Q16735822
LAST	P31	Q1497649	S248	Q27477970	
LAST	P969	"mehrere Standorte in den Kantonen Bern und Freiburg"	S248	Q27477970	
LAST	P281	"3270"	S248	Q27477970	
LAST	P856	"http://www.fort-fribe.ch/"
LAST	P381	""	S248	Q27477970	
LAST	P1435		S248	Q27477970	
LAST	P131	Q64113	S248	Q27477970	
LAST	P131	Q11911	S248	Q27477970	
LAST	P17	Q39	S248	Q27477970

Example items:

Issues:

  • Issue 3: The Quick Statements Tool processed the data for about 25 items (348 statements), after that it came to a halt. Reason: the item it processed was missing a statement for P31 (empty field in the database). Interestingly, empty fields for P1435 don't bother the tool and are skipped. Empty strings ("") are also treated as null and are simply skipped by the tool. Corrective action: Replace empty P31 fields in the database by "".


Updated Code:

CREATE
LAST	Lde	"Historisches Museum Uri"
LAST	Dde	"Museum in Altdorf (Schweiz)"
LAST	Lfr	""
LAST	Lit	""
LAST	Len	""
LAST	Den	"museum in Altdorf (Switzerland)"
LAST	P3066	"CH-000033"	S248	Q27477970
LAST	P31	Q1595639
LAST	P31	""
LAST	P31	""
LAST	P31	""
LAST	P31	Q1497649	S248	Q27477970	
LAST	P969	"Gotthardstrasse 18"	S248	Q27477970	
LAST	P281	"6460"	S248	Q27477970	
LAST	P856	"http://www.hvu.ch"
LAST	P381	"8579"	S248	Q27477970	
LAST	P1435	Q8274529	S248	Q27477970	
LAST	P131	Q68150	S248	Q27477970	
LAST	P131	Q12404	S248	Q27477970	
LAST	P17	Q39	S248	Q27477970

Example items:

Issues:

  • Issue 4: The Quick Statements Tool cannot be reached. Action: Wait and try another time (the tool was down for some time, but worked again a few hours later).

Observations:

  • Ingestion of ca. 50 items (ca. 1000 triples) takes ca. 15 minutes (i.e. the tool is writing a bit more than one line per second).
  • In order to generate the lines to be entered into the tool, the mail merge ("Mailings") functionality of Microsoft Word was used, which allows creating a template and connecting its elements to columns in an Excel spreadsheet. Once the resulting document is generated, the lines can be copied and pasted into the Quick Statements Tool. It makes sense to start with 1 or 2 items first in order to debug the template; when everything is ok, it is possible to copy-paste the data in batches of several thousand lines at a time into the tool.
  • It is possible to interrupt the Internet connection (e.g. when sending the computer into sleep mode). This will however interrupt the ingestion process; upon reconnection with the Internet, the ingestion process resumes where it had come to a halt.
  • When the ingestion process is done (message: "All done"), it is worthwhile doing a quick search for the string "ERROR" in the ingestion log. It may help you spot potential double entries (items with the same label) or badly formed data (e.g. urls with a missing "http://". In the case of double entries, the double entry is actually generated; you will then have to check whether the items in question should be merged. Badly formed urls are simply skipped by the tool. This behavior of the Quick Statements Tool was just observed by chance while executing the job; it would be useful to document all the checks and ensuing action in a systematic manner in the future.

Complement existing items edit

Complement existing items on Wikidata using the Quick Statements Tool.

Code:

Q301235	Lde	"Aargauer Kantonsbibliothek"
LAST	Dde	"Kantonsbibliothek in Aarau (Schweiz)"
LAST	Lfr	""
LAST	Lit	""
LAST	Len	""
LAST	Den	"cantonal library in Aarau (Switzerland)"
LAST	P3066	"CH-000002"	S248	Q27477970
LAST	P31	Q678405
LAST	P31	""
LAST	P31	""
LAST	P31	""
LAST	P31	Q1497649	S248	Q27477970	
LAST	P969	""	S248	Q27477970	
LAST	P281	""	S248	Q27477970	
LAST	P856	""
LAST	P381	"8924"	S248	Q27477970	
LAST	P1435	Q8274529	S248	Q27477970	
LAST	P131	Q14274	S248	Q27477970	
LAST	P131	Q11972	S248	Q27477970	
LAST	P17	Q39	S248	Q27477970

Example: before / after

Observations:

  • When adding statements to existing items,
    • labels are overwritten (empty labels as well!) Therefore, remove any empty label declarations.
    • descriptions are overwritten (empty descriptions as well!) Therefore, remove any empty descriptions.
    • aliases ??? (needs to be checked; the data didn't contain any aliases)
    • empty statements, such as LAST P856 "" didn't overwrite existing values.
    • sources for existing statements are added (there is no unwanted duplication of the statement itself). However: the reference was not added to the PCP reference number (8924) - Why?
  • This time, the tool writes only about one triple (one line) per 2 seconds (500 triples in 15 minutes).

Issues:

Visualize the data using SPARQL queries in order to inspect the data (quality check) edit

(See the SPARQL queries above.)

Visualization using the Listeria Tool:

Note: These lists seem to also include items which do not have a statement "instance of" museum, library, or archive, which is unexpected. Examples:

I would consider this a bug; at the least, the functionality does not seem to be properly documented.