Wikidata:WikiProject Cultural heritage/Guidelines/Ingesting Datasets into Wikidata

This is a draft version that needs to be reviewed.

Author: Affom (talk) 16:05, 12 May 2017 (UTC)[reply]

Introduction edit

This user guideline aims to show how bigger datasets can be ingested into Wikidata. As the creation of this guideline is embedded in the https://www.wikidata.org/wiki/Wikidata:WikiProject_Cultural_heritage it focuses on Cultural heritage related data and performing arts related data. However, the guideline can also be applied for any other kind of data, preferably obtained from existing, bigger datasets. Furthermore, the guidelines focuses on technical issues and help for the tools that were used during the process. If you are looking rather for the process itself on concrete examples please refer to the reports of the ingestion processes that are linked under the report section of the WikiProject Cultural heritage.

Introductions to Wikidata edit

Before you can follow the instructions of this case report you should be well acquinted with the basic concepts of Wikidata. Especially the concept of properties and items should be known to you. To get to know Wikidata this video is very useful.

Procedure: edit

The procedure followed during the ingestion was first introduces by Beat Estermann who himself adapted it from similar reports. I changed the process slightly. This user guideline follows the procedure and tries to give tips and tricks whenever I had myself difficulties in executing certain steps. It is suggested to use this procedure iteratively. This means, you don’t finish it step by step but you go always back to previous steps. For example the class diagram from step one or the mapping from step two you should try to update them every time you made a change that is relevant for them.

  • Step 0: Create a Wikidata account or log in in your existing account
  • Step 1: Get a thorough understanding of the data (class diagram: what are the classes? what are the properties? what are the underlying definitions?)
  • Step 2: Analyze to what extent Wikidata already has the required data structures (relevant classes and properties already defined?). If necessary, create the relevant classes and properties (community discussion required). Create a mapping between the source data and the target data structure on Wikidata.
  • Step 3: Check whether data formats in the source data correspond to acceptable formats on Wikidata. If not, some data cleansing may be required.
  • Step 4: Check whether there is an explicit or implicit unique identifier for the data entries. – If not, think about creating one.
  • Step 5: Check whether a part of the data already exists on Wikidata. If yes, create a mapping in order to add data to already existing items and to avoid creating double entries in Wikidata.
  • Step 6: Model the data source in Wikidata by creating an item of your data source. If there are multiple releases create an item with the actual release you acquired the data from.
  • Step 7: Clean up existing data on Wikidata. (Can be done during previous steps)
  • Step 8: Ingest the data, providing the source for each statement added (avoid overwriting existing data). Log conflicting values. At the beginning, proceed slowly, step by step; check which aspects of the pload process can be automatized using existing tools and which aspects need to be dealt with manually. Ask community members for a review of the first items ingested, before batch ingesting.
  • Step 9: Visualize the data with SPARQL queries in order to inspect the data (quality check).
  • Step 10: Write and publish your report within a suitable Wikiproject.

Tools: edit

Name Website/download Domain used Licence
Libre Office https://www.libreoffice.org/ (10.05-2017) Spreadsheet Free, Mozilla Public License v2.0
Microsoft Office https://www.microsoft.com/de-ch/download/office.aspx (10.05-2017) Spreadsheet&

MailMerge

Trialware
OpenRefine www.openrefine.org (10.05-2017) Data Cleanup Free, BSD License
reconcile-csv http://okfnlabs.org/reconcile-csv/ (10.05-2017) Fuzzy matching Free, BSD Licence
Quick-Statments V1: https://tools.wmflabs.org/wikidata-todo/quick_statements.php (10.05-2017)

V2: https://tools.wmflabs.org/quickstatements/ (10.05-2017)

Adding Statements to Wikidata Free, GNU General Public License (GPL)

The tools used are explained in the following table. The installation of the tools is explained on the website itself. The installation and use of OpenRefine and its add in reconcile csv will be shown in detail in step 3.

Tips and explanations to the steps edit

Step 0 edit

If you do not have one yet you should create a new account. You can do so here https://www.wikidata.org/w/index.php?title=Special:CreateAccount&returnto=Wikidata%3AMain+Page. The account is needed if you work with the Wikimedia intern tools such as QuickStatements. Furthermore, it helps other people to get in touch with you as you will get a notification once someone mentions you in an article.  

Step 1: edit

During this step I would suggest creating an actual class diagram with all the properties from your actual dataset. This step can already be combined with step 3, once you see that there are properties that obviously can’t be used for Wikidata you can directly get rid of them.

The output of this step is a class diagram with every property you want to ingest in Wikidata.

Step 2: edit

In this step you should create a table with all the properties you identified in your class diagram from step 1. The table should afterwards look as following:

Property in datafile Corresponding Wikidata property Refers to Existing Wikidata item / possible value Remarks
Kind of species https://www.wikidata.org/wiki/Property:P105 https://www.wikidata.org/wiki/Q43576 Species of mammal: wild cat

This table is done with the example of a cat. Instead of just writing “cat” in the “possible value section” try to put in an actual Wikidata item whenever possible. Write a remark whenever you think helps to understand your thought process (remember, you aim to publish your report, it should afterwards help others).

Most of the times you can include the property https://www.wikidata.org/wiki/Property:P31 in any case, even though there is quite a high chance that it is not a property from you dataset. Also try to fill in as many links to existing Wikidata items as possible. If you need to put in a String that describes a noun, the chance is big that there is already an existing Wikidata item for this thing. If this is not the case, think about creating one.

 Now, how do you get to the corresponding properties and items?

  • The easiest thing you can do is just type the word you are looking for in the search bar on the top right of the website. Then make sure the description of the item or property you found matches with what you want to describe in your dataset.
  • A better solution is to search for an item of your database that already exists in Wikidata. For example if you have a dataset with information about lots of different animals then search for a cat. Then look up what sort of properties are described for this sort of item. Most of the time most of them already match with the properties of your dataset.
  • A third solution is to look up possible properties in the property search section or in the full list of properties.

Property proposal:

There may be the case that you have a property in your dataset that does not yet exist in Wikidata. In this case you need to propose it in a special section. The property proposal is a special template that needs to be filled in. Afterward the community will discuss about it and if there a more supporters than opposes it gets approved. Try to give as many information as possible about the property to enhance the possibility of an approval.

The output of this step is a mapping table with every property you want to ingest and its corresponding Wikidata property/item.

Step 3 edit

In this step tips about the process of data cleansing are given. Clean data is necessary as the data you make available in Wikidata should have as high a quality as possible.

I found that the best tool here is OpenRefine (formerly GoogleRefine) a spreadsheet likely tool that focuses on mass editing of data. Although some minor data cleansing can also be made by Microsoft Excel.

Getting started in OpenRefine

When you haven’t yet downloaded OpenRefine you can do it here www.openrefine.org

Afterwards the best thing to do is spend a few minutes and watch these videos that explain the basic concepts of OpenRefine. Note that the in the video the version of the application was still called GoogleRefine, but nowadays most of the functions work still the same.

https://www.youtube.com/watch?v=B70J_H_zAWM

https://www.youtube.com/watch?v=cO8NVCs_Ba0

https://www.youtube.com/watch?v=5tsyz3ibYzk

These three videos should explain the basic concepts quite well. The function I found most usable for data cleansing is the “facet” function, which sorts you data automatically into facets based on the similarity of data. Afterwards it is possible to mass-edit just a certain facet of your data. To see how this is done watch video 1. 

Transfer data from OpenRefine to Excel and back

The most common and generic format table data is stored in is .CSV. However, I found that Excel does not handle the format that well. Every time I need to download data from OpenRefine in .CSV and open it afterward in Excel it kept changing formats. It could not handle mutual vowels such as ä, ö, ü which are quite common in German. As a workaround I used the Libre Office Spreadsheet tool, which is an Open Source spreadsheet application that works very similar to Excel but could handle the .CSV format way better. Once you open the data in LibreOffice it can be saved as .xslx and opened again in Excel.

Useful OpenRefine command can be applied while using GREL expressions. A list of String GREL expression can be found here:

The output of this step is a clean dataset.

Step 4: edit

Having a unique identifier in your data has many advantages. Firstly, it is useful there is already existing data from your dataset in Wikidata. You can use the unique identifier for the matching. Secondly, once your all of your data is ingested in Wikidata you could use it as your primary database. It doesn’t matter if someone adds other statement to the items, you can still you retrieve the part of it you need. Afterwards the unique identifier can again be used with your old data to update it.

What to do if there is no unique identifier?

If you don’t have any unique identifier but a large dataset it could be useful to just create one yourself. Preferably you are data owner in this case. Once you have given each item in your dataset a unique identity number, you can use property https://www.wikidata.org/wiki/Property:P217 as a generic id property and add your identifier as a String.

The output of this step is your dataset with a unique identifier.

Step 5: edit

This step has two difficulties. The first one is the most crucial point of the whole process. You need to check whether part (or all) of your data already exists in Wikidata. The second step is to do the actual mapping of your dataset with the existing one. It is important that you not only merge the corresponding data of the main items your dataset is about, but also all the other properties your dataset contains that could be present in Wikidata (i.e. countries, municipalities, languages etc).

To check if part of your data is already in Wikidata the best way to do so is by creating an appropriate query and apply it at Wikidata’s SPARQL-endpoint. https://query.wikidata.org/   .

If your dataset has a unique identifier this will be much easier. In the best case the unique identifier has its own property and you can just make a query for items that have that specific property. If you do not have a unique identifier you have to create a query that describes your data as good as possible and hope for easy distinguishable names for the mapping afterwards.

How do I learn SPARQL?

If you haven’t used SPARQL before you should make yourself first acquainted with its basic concepts. Wikidata SPARQL-endpoint works similar uses almost the same syntax as a normal SPARQL query, apart from the face that it uses the Wikidata intern Q- and P- numbers. You find a lot of information in this section https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service

How do I create useful SPARQL queries?

A good idea to do so is to look into previous reports in your WikiProject and look if someone has written SPARQL queries. Afterwards I found it most useful to make your query as short as possible, as it reduces the danger to leave out data. You can either use instance of or the item of which your item is a subclass of: for example https://www.wikidata.org/wiki/Q622852 instead of https://www.wikidata.org/wiki/Q16521 in order to reduce the chance of omitting items.

How do I handle languages in SPARQL queries?

The easiest way to do is by adding the following snippet to your Where clause:

    Service wikibase:label {

      bd:serviceParam wikibase:language "de" .

      }

(you can also add further languages ("de, en, fr”)

Then you just have to add the word label to the item you are looking for. For example if your query is:

Select ?person

 where {

            ?person wdt:P39 wd:Q11811941 .

Service wikibase:label {

             bd:serviceParam wikibase:language "de" .

             }

}

you add. personLabel? into your Select clause. Doing so you automatically get the Label for your retrieved item.

Another, more elegant way to handle languages is by adding each language as Optional into your Where clause

            OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") }     

OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }

            OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }

            OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }

Of course, you also need to add them to your SELECT clause: ?Label_de and so on.

How do I handle double entries?

This is quite a common problem I unfortunately can’t handle very well. Try to use SELECT DISTINCT.

(need reviewing: please enter a better solution here)

What if I don’t find any result what with any SPARQL query at all?

There can always be the case that you cannot find any result, for example if the items already existing in Wikidata are poorly described or do not have any property at all. In this case you can either do a manual search in the search bar or try to look up items by using Wikipedia’s categorie pages. Often items that have an existing Wikipedia article were automatically transferred to Wikidata (although poorly described). The category pages concentrate these articles in one page. From each page you can afterward look up its corresponding Wikidata item in the tool bar on the left hand. 

I am unable to create any SPARQL queries at all. What do I do?

You can still ask the community if someone helps you with your query. The place to do so is https://www.wikidata.org/wiki/Wikidata:Request_a_query.

Once you have retrieved a list of possible candidates from the SPARQL-endpoint you can start with merging them with your existing dataset.

How do I merge data with a unique identifier?

The best way to do so is with the help of OpenRefine. Basically you need to create two separate OpenRefine project. One with the original dataset (let’s call it MergeTestA) and another with the dataset that you retrieved from Wikidata and that contains the same unique identifier you are using in your original dataset. (Let’s call it MergeTestB). Afterwards you create in new column based on the column with the unique identifier in MergeTestA and use the following GREL expression:

cell.cross("Merge Test B", "Unique ID").cells["Q-Number"].value[0]

“Unique ID” of course points to the unique identifier in the second project (MergeTestB) and “Q-Number” is the name of the column that contains the data you want to match (which is the Q-Number of the corresponding Wikidata item in most of the cases). This procedure was introduces in a blog  by Tony Hirst. The link contains useful pictures and also some additional code you can use.

How do I merge data that without a unique identifier (fuzzy data)?

Data that has no unique identifier can be merged also in OpenRefine with the help of the add-in reconcile-csv. A common case where this applies is the matching of municipalities. On one hand you have a list with municipalities in your original dataset, on the other hand you have the list with the municipalities from Wikidata which you need to merge with the name. The name of the municipality can be slightly different in the two datasets or sometimes two municipalities have the same name and you need to decide to which one your dataset refers. The reconcile-csv add in can help here with its fuzzy matching algorithm. The following steps have to be applied

  1. Create a new OpenRefine project, and ingest the .csv file (if not yet done).
  2. Create a second project with a list of every Swiss municipality and its corresponding municipality code (FSO_code).
    1. This is needed merge the number from the second list (let’s call it list B) into the original list (list A).
    2. This list can be retrieved from Wikidata itself by using a appropriate SPARQL query.
  3. Merge the corresponding Wikidata item from the second project (List B) into the first (List A)
    1. using the reconcile-csv extension for OpenRefine
    2. Download the .jar file from the homepage (link above) Follow the instructions from the Website to merge the data. I encountered several problems while strictly following the instructions, however, I succeeded with the following steps:
      1.  
        Folder, where a new folder called "Extension" has to be created. The .jar file has to be stored here, as well as the .csv file.
        Create a new folder called reconcile in the extension folder where OpenRefine is installed
      2. Save both, the .jar file of the extension and the .csv file from List B in this folder.
      3. Start the extension using cmd: java -Xmx2g -jar reconcile-csv-0.1.2.jar “List B.csv”Search Colum”Id column   
      4.  
        CMD Command to start reconcile-csv
        “Search column” is the primary column you want to use for matching. E.g. you want to match facilities with names spelled slightly differently - this would be the column to add here. “Id column” is the column containing unique ids for the facilities - if you don’t have one: generate one.
      5. Start OpenRefine and open project “List A”
      6. Select the column with the names for the matching (i.e municipalities) and select “reconcile” and “start reconcile” in the options.
      7. Add a standard reconciliation service pointing to http://localhost:8000/reconcile and start the reconciliation.
      8.  
        Suggestions given by the reconcile-csv tool.
        Now you can compare the matchings done by the algorithm. Items that match 1:1 are matched up automatically. For the rest there are suggestion based on the similarity of the matchings where you can choose the best matching one. Usually, up to a score of 0.8 it is the first entry. Choose the suggestion which matches the entry until there are none left.
      9. Once you found all the matches you can find- create the unique id column. Do this by selecting “edit column -> add column based on this column” from the options of the column where you did your matching. Now give the column a name (the same as in the other spreadsheet helps incredibly) and use the expression cell.recon.match.id to get the IDs for the matches
      10. You now have the new unique ID’s in your A-List.

The output of this step is your dataset with every corresponding Wikidata Q number where you have one.

Step 6: edit

This step is important because you should under every statement you intend to ingest add a claim where your information comes from. In most of the cases this is just your original dataset. Instead of just refer to it (how this is best done will be explained in step 8) in a simple string you should create an actual item that represents your dataset or better even the current release of your dataset.

The output of this step is a Wikidata item of that of your dataset (or its release).

Step 7 edit

This step can be best done directly during the steps where you already need to look upon existing items in Wikidata (mostly step 5). The larger the field of data you are working with the more time consuming will this be. The improvement of existing data will be directly done in step 8 as you also add statements for items that already exist in Wikidata. Editing items can be done either bed done directly on the website for the item (click on edit for the property or label you want to change, do not forget to save) or with QuickStatements (see step 8)

The output of this step is cleaned data in Wikidata.

Step 8 edit

The ingestion of bigger datasets is best done using the QuickStatments tool. After a while using the first version https://tools.wmflabs.org/wikidata-todo/quick_statements.php of the tool it suddenly stopped working for several days. I then started using the second new version of the tool https://tools.wmflabs.org/quickstatements/ However, it should be paid atteintion that the creator :User Magnus does not recommend to actually use the tool outside the Wikidata sandbox <-- hier link) as it is still under creating. QuickStatements uses a special syntax which is described directly in the website nochmal link.  This is a picture that shows the code for adding a statement (instance of something) to the existing item Q4115189 (which is a sandbox item) including a start date for the property as well as a reference (which would be the item we created in step 6).

 
Quickstatements: editing of an exsisting item

 

If you wand to create a new item instead of adding as statement to an existing one simply use CREATE instead of the Q-number.

The code is best created using Microsoft’s Office MailMerce function. It is essential that you proceed step by step rather than just copy paste everything into the tool in one dump. Doing so, errors can be corrected more easily. 

Creation of the Code for Quick Statements using Mail Merge

The following points describe some obstacles that can occure and how to overcame them:

  • Different Strings had to be used, depending on the type of the item (for example different Wikidata Q Numbers in the instance of porperty for each type of the item.)
    • Use the merge if statement which can be found in the options
    • Even better: model the different String in directly in a new column in Excel and just add an empty String (“”) in the code. QuickStatements will just jump over it.
  • if Statement had to be altered
    • first press “Alt  + F9” to see the actual code in the mail merge document, then alter them manually
  • The code can only be altered using by using special merge bracelets
    • press “CTRL + F9” with the cursor at the place where the bracelets should be added.
  • The generated text differs from the original data source (eg. Excel File)
    • Before even starting the process the following options have to be activated:
      • File > Options > Advanced
      •  Under General, and select the Confirm file format conversion on open check box and click OK.
      • While choosing your datafile, in the Confirm Data Source dialog box, select the Show all check box, choose MS Excel Worksheets via DDE (*.xls) > OK.

Ingestion and improving of the Quick Statement Code.

The following table shows some code, the errors that occurred, and how it could be improved. Sometimes the improvement could be done by myself and sometimes the community was asked.

Code / updated code Problem Solution
CREATE

LAST   Lfr      „Chapelle du Sacré-Coeur“  

LAST   P381   „9734“

LAST   P1435  Q8274529

LAST   P131   Q28035756

LAST   P131   Q12640

LAST   P31     Q4989906

LAST   P625   “46°50'50''N, 6°50'53.1''E“

LAST   P969   "Chemin du Sacré-Coeur 2“

coordinates did not appear - coordinates have to be ingested using the decimal format and with a special syntax: @LAT/LON
P381, the PCP, number did not appear The quotation marks have to be written without any special format (" and nor “)

→ Press CTRL + Z after writing them.

no reference put S248 and the source as Wikidata object after every statment.
Q1590954            Lde           Schlossgarten

LAST                     P31           Q4989906                 S248             Q28962694

LAST                     P131         Q14274                     S248             Q28962694

LAST                     P1435       Q8274529                 S248             Q28962694

LAST                     P131         Q11972                     S248             Q28962694

LAST                     P625         @47.39405188/8.045910345        S248           Q28962694

LAST                     P969         "Laurenzvorstadt 3"  S248             Q28962694

LAST     P381    "9388" S248    Q28962694     

Coordinates have too many digits This error can’t be corrected. The digits appear in any case.
Coordinates and addresses appear twice when they where already stated on the orignal Wikidata object When there is certainty that the data from the original source is correct (eg. better than the data already in Wikidata) Old data could be deleted beforehand using the new Quick Statements tool.
wrong class for the objects:

monument instead of building

correct the statements and change into architectural structure
Objects with the property instance of memory institution may be wrong on some objects. As there have been already completely ingested in Wikidata the statement doesn’t have to be added at all.

Step 9 edit

Again run some SPARQL queries, preferably the same you did in step 5. Enjoy this better and more complete result, but do not forget to inspect it properly if there are any errors. (There should not be because you already made sure so in step 8 while proceeding step by step.)

Step 10 edit

During the whole process you should take notes of what you have done. At the end you can write a report and publish it whithin a suitable Wikidata project. Your report will help other members of the community to improve their work. Also include errors you made during the process, so that people can learn from them.