Wikidata:WikiProject Cultural heritage/Reports/Ingesting the Swiss Inventory of Cultural Properties

Author: Affom (talk) 15:17, 12 May 2017 (UTC)[reply]

Introduction to the case report edit

Background edit

This report describes the process of ingesting data from the Swiss Inventory of Cultural Properties (built heritage and heritage collections). It is the first one in a series of case studies. As this is the author's very first attempt to do ingest data into Wikidata, the entire process is very hands on. Some of the tools (eg. OpenRefine, Quick Statements etc.) and techniques (SPARQL, RDF), which are required during the process, have never been used before by the author. The underlying idea is that the procedure will be improved step by step as the series of case studies is expanded.

Data Sources edit

The present case study refers to the ingestion of data from the Swiss Inventory of Cultural Property of National and Regional Significance (KGS).

Procedure edit

The procedure followed contains nine steps:

  • Step 1: Get a thorough understanding of the data (class diagram: what are the classes? what are the properties? what are the underlying definitions?)
  • Step 2: Analyze to what extent Wikidata already has the required data structures (relevant classes and properties already defined?). If necessary, create the relevant classes and properties (community discussion required). Create a mapping between the source data and the target data structure on Wikidata.
  • Step 3: Check whether data formats in the source data correspond to acceptable formats on Wikidata. If not, some data cleansing may be required.
  • Step 4: Check whether there is an explicit or implicit unique identifier for the data entries. – If not, think about creating one (how best to go about this needs clarification).
  • Step 5: Check whether a part of the data already exists on Wikidata. If yes, create a mapping in order to add data to already existing items and to avoid creating double entries in Wikidata.
  • Step 6: Model the data source in Wikidata (how best to go about this needs clarification).
  • Step 7: Clean up existing data on Wikidata.
  • Step 8: Ingest the data, providing the source for each statement added (avoid overwriting existing data). Log conflicting values. At the beginning, proceed slowly, step by step; check which aspects of the upload process can be automatized using existing tools and which aspects need to be dealt with manually. Ask community members for a review of the first items ingested, before batch ingesting.
  • Step 9: Visualize the data using the Listeria tool or SPARQL queries in order to inspect the data (quality check).

This procedure was introduced by Beat Estermann in his report on ingesting data about Swiss heritage institutions.

Field notes edit

Step 1 edit

 

I found that the first step of the proposed procedure is strongly linked to the second step. As the final product of the second step is to compare the properties of an item from the original data source with its Wikidata counterpart, a separate analysis of the data source only is not very useful. I nevertheless created a class diagram to provide an overview of the data I was working with.

Step 2 edit

As for step 2, the following mapping was created:

Property in datafile Property in Wikidata Refers to class in WD / possible values Remarks
KGS Nummer PCP reference number Wikidata property for cultural heritage identification
KGS Kategorie Property:P1435 class A Swiss cultural property of national significance (Q8274529)

or class B Swiss cultural property of regional significance (Q12126757)

Gemeinde located in the administrative territorial entity (P131) Q14274 Aarau
Kanton located in the administrative territorial entity (P131) Q23058 Aargau
Koordinaten coordinate location (P625) 27°59'17"N, 86°55'31"E The coordinates in the original data source appear in a unique Swiss format (CH1903). They need to be changed to the international WGS84 format. Furthermore, coordinates are ingested in the decimal format.
Objekt Art 1 Property:P31 The property describes the category-name provided by the Swiss authorities. There are only 4 different object categories:
  • Bau (architectural structure);
  • Archäologie (archaeological site);
  • Sammlung (collection/memory institution); and
  • Bau/Archäologie (combination of architectural structure and archaeological site).
Adresse located at street address (P969) Bahnhofstrasse 91, 5000 Aarau The description of this property states: full street address that the subject is located at. Include building through to post code.

However, most of the addresses already in Wikidata lack the post code. Sometimes the post code is in a different property. This practice should be reviewed.

Hausnummer same as above same as above same as above

Creation of a new property edit

A part of the objects listed in the catalog already existed in Wikidata (e.g. Q444938). Often, the label in the catalog, provided by the Swiss Federal Office for Civil Protection (in this case: "Altreu, mittelalterliche Stadtwüstung"), does not correspond to the label in Wikidata. In some cases both labels are relevant, which means that the label originally provided in Wikidata should not just been overwritten. Therefore, a new property is needed to provide the name that is used in the catalog.

The new property has to be proposed first, following the idea that not every user should be able to create new properties and they should first be discussed in the community. New properties can be introduced in the Property proposal section.

Step 3 edit

In step 3, I started to work on the data source, which consisted of two spreadsheets from the Swiss Inventory of Cultural Property of National and Regional Significance (KGS). During this step I completed several tasks:

Merging the two datasets and introducing a new variable (A or B object)

The dataset I got contained two different objects, the first one “Swiss cultural heritage objects of national significance (A objects)” and the second one “Swiss cultural heritage objects of regional significance (B objects)”.

This step could easily be done with the help of Excel or any other simple spreadsheet tool.

Minor Data Cleansing

In this task the OpenRefine Tool was used. The tool’s “Facet” clustering function allows powerful data cleansing functions. Obvious errors such as unnecessary blank spaces or spelling errors can be found by the tool and easily be corrected.

Transformation of the coordinate location from the Swiss format CH1903 to the international format WGS84

The coordinate location in the data source was provided in a unique Swiss format called CH1903. To change it into the WGS84 format, which is used by Wikidata, the algorithm provided on [1] was used. The implementation of the algorithm was done using Microsoft Excel.

Step 4 edit

Each entry in the data set has the variable “KGS number” which corresponds to the inventory number given by the Federal Office for Civil Protection (FOCP). The PCP reference number is described under the following property: PCP reference number (P381). The matching with existing data was done by using the unique identifier.

Step 5 edit

To extract the data from Wikidata, SPARQL queries had to be created and entered into the Wikidata SPARQL-endpoint. Creating the right queries turned out to be rather difficult. The following query returns every item that is either a class A Swiss cultural property of national significance (Q8274529) , or class B Swiss cultural property of regional significance (Q12126757)  object.

It turned out that by creating a query as short as possible, the chance of getting duplicates could be reduced. However, I still encountered a few issues, including duplicates. The issues are described in the table below:

No Problem How to find out Possible Reasons Solution
1 278 out of 3296 items (WIKI Q number) are duplicates. 1. Excel conditional formatting → colour the duplicates

2. Filter all the couloured cells

Either one has to be identified.

While it is possible that an object has two PCP numbers, there should never be several objects having the same PCP number. This leads to the second sublist being more relevant.

Remove real duplicate entries on Wikidata; ignore duplicates that resulted from the query.
2 312 out of 3296 items had the same PCP number (most of the time the items correspond to the same as described in the problem above, but not always)
3 no PCP number insert it in Wikidata

(1 occurrence)

4 no PCP number (but not a double entry) Sort in Excel - Try to match it anyways using other identifiers such as the label.

As there are a lot of entries, doing this manually might take quite a long time. Maybe the reconcile-csv tool might be useful

(134 occurrences)

5 Altstadt Rapperswil delete! Neither is the object in the original dataset (= no specific object of national or regional significance) nor is it worth creating a Wikidata object.
6 http://www.wikidata.org/entity/Q1979552   cannot be matched
7 http://www.wikidata.org/entity/Q15130295 cannot be matched

During step 5, the mapping of the municipalities and cantons was done as well. Each municipality/canton was mapped with its corresponding Wikidata object (which were already existing in Wikidata). This task was done by using the OpenRefine tool and reconcile-csv which is an add-in that allows fuzzy matching.

Step 6 edit

Each statement that is ingested into should have a statement about the source of the data. This statement can be added in Wikidata as a simple string which gives information about the source (e.g. a web link) or it can be a Wikidata item itself. For this case study, the latter option was chosen.

The source of the data used in the case study is the KGS inventory. Because this inventory gets updated every year, a specific object for the given release date of the inventory was created. Because this specific item already existed for the A-classified objects just the item for the B-classified objects had to be created.

Step 7 edit

Several existing data entries in Wikidata had already been cleaned up during the previous steps. However, there still seems to be a lot of work to do. The SPARQL query returns a list of Wikidata items with a KGS number. After looking through the list 61 double entries were discovered. These double entries were later merged using QuickStatments. The code which was put into the tool was generated in the same way as described in step 8, using the Office Mail Merge function.

Some of the double entries were special cases, however, and had to be dealt with on an individual basis; two examples are described in the table below:

examples description
Kloster St. Andreas (Sarnen)  The first item is meant to be the building, which is a monastery. The second one is the memory collection of the monastery. The KGS inventory specifically refers to the collection, therefore the KGS number for the first item is false.
Sammlungen des Frauenklosters St. Andreas  
Museum für Kommunikation / Musée de la communication Roughly the same case as above, this time for a museum.
Sammlung des Museums für Kommunikation

Step 8 edit

The ingestion of the data was done using the Quick Statments Tool. To do so I proceeded step by step, ingesting one data entry at a time at the beginning, and proceeding with larger batches towards the end. Thereby, the code which was used for Quick Statements Tool was gradually debugged and improved.

Creation of the Code for Quick Statements using Mail Merge edit

The Code to was created using the Mail Merge function of Microsoft Word. After many trial and error sessions I finally managed to create a code that could be copy-pasted into the Quick Statments Tool.

In the following, a few obstacles I encountered are described along with the approach I used to overcome them:

  • Different Strings had to be used depending on the type of the item (for example different Wikidata Q Numbers in the instance of property for each type of the item.)
    • Use the merge if statement which can be found in the options
  • if Statement had to be altered
    • first press “Alt  + F9” to see the actual code in the mail merge document, then alter them manually
  • The code can only be altered by using special merge bracelets
    • press “CTRL + F9” with the cursor at the place where the bracelets should be added.
  • The generated text differs from the original data source (eg. Excel File)
    • Before even starting the process the following options have to be activated:
      • File > Options > Advanced
      • Under General, and select the Confirm file format conversion on open check box and click OK.
      • While choosing your data file, in the Confirm Data Source dialog box, select the Show all check box, choose MS Excel Worksheets via DDE (*.xls) > OK

Ingestion and improving of the Quick Statements Code.

The following table shows some code, the errors that occurred, and how it could be improved. Sometimes the improvement could be done by the author himself and sometimes the community was asked for assistance.

Code / updated code Problem Solution
CREATE

LAST   Lfr      „Chapelle du Sacré-Coeur“  

LAST   P381   „9734“

LAST   P1435  Q8274529

LAST   P131   Q28035756

LAST   P131   Q12640

LAST   P31     Q4989906

LAST   P625   “46°50'50''N, 6°50'53.1''E“

LAST   P969   "Chemin du Sacré-Coeur 2“

coordinates did not appear coordinates have to be ingested using the decimal format and with a special syntax: @LAT/LON
P381, the PCP, number did not appear The quotation marks have to be written without any special format (" and nor “)

→ Press CTRL + Z after writing them.

no reference put S248 and the source as Wikidata object after every statement.
Q1590954            Lde           Schlossgarten

LAST                     P31           Q4989906                 S248             Q28962694

LAST                     P131         Q14274                     S248             Q28962694

LAST                     P1435       Q8274529                 S248             Q28962694

LAST                     P131         Q11972                     S248             Q28962694

LAST                     P625         @47.39405188/8.045910345        S248           Q28962694

LAST                     P969         "Laurenzvorstadt 3"  S248             Q28962694

LAST     P381    "9388" S248    Q28962694     

Coordinates have too many digits Change it on the original data source just take the first two digits after the comma
Coordinates and addresses appear twice when they where already stated on the orignal Wikidata object When there is certainty that the data from the original source is correct (eg. better than the data already in Wikidata) preexisting data should be deleted beforehand using the new Quick Statements tool.
wrong class for the objects:

monument instead of building

correct the statements and change into architectural structure
Objects with the property instance of memory institution may be wrong on some objects. As all Swiss heritage institutions had already been ingested in Wikidata before, the statement did not have to be added at all.

Step 9 edit

The same SPARQL query as in step 5 was run again. The results now contained the new data the author ingested. The data was looked through in order to inspect its quality. The Listeria tool was not used in this step.