Wikidata:Tools/OpenRefine/Editing/Tutorials/Inverse Listeria

Listeria is a powerful tool to generate wiki tables from SPARQL queries and keep them synchronized with Wikidata. Many Wikipedias use it to display elaborate tables which need only to be maintained centrally instead of duplicating the work in each Wikipedia. However, very often the data needed to generate these tables is simply not in Wikidata yet: we need to import it in the first place. This tutorial explains how to import a manually maintained table from Wikipedia into Wikidata.

Caveats edit

It must be noted that not all Wikipedia tables are suitable for import. Not all of them are reliable and current. It is important to check that the information in the table is reliably sourced and not distorted during the import.

Importing the table edit

In this tutorial, we are going to import data about metro systems. The English Wikipedia has an impressive table listing such systems and this is the example that we will use.

One way to import a table found on the web is simply to select it with your mouse, copy it, and paste it in OpenRefine's clipboard import area. This works surprisingly well in many cases, and we could do this here. However, in the case of Wikipedia, this has the disadvantage of discarding valuable information:

the wikilinks included in the table give us reconciled values for free, as the wiki pages they link to are associated with Wikidata items. For instance, the name "SkyTrain" would normally be ambiguous (as it can refer to many different Wikidata items) but in this case it is linked to SkyTrain (Vancouver), a Wikipedia page associated to SkyTrain (Q876780). So we should reuse this information;

the references supplied in each cell value indicate where the information came from: we do not want to discard these sources but import them in Wikidata instead.

Copying the wikicode of a table from a Wikipedia article.

We will therefore use a dedicated importer to create an OpenRefine project from our table. Click Edit in the section of the Wikipedia article where the table is. Make sure you are using the manual text editor and not the visual one because we need to obtain the wikicode of this table. Copy the entire table (which starts with {| and ends with |} and paste it into the clipboard import area of OpenRefine (you can actually copy the entire List subsection as it contains only one table: OpenRefine will ignore anything outside the table).

Previewing the import takes a while: this is because OpenRefine queries Wikidata to resolve the links between the wiki links and the Wikidata items. By default, OpenRefine ignores templates in Wikicode, but in our case the countries and some distances are indicated using templates, so we can tick Include templates and images as raw wikicode. It is also necessary to indicate which Wikipedia the wikitext was taken from (the field points to the English Wikipedia by default, which is fine in our case). Finally, validate the import: creating the project should also take time (but slightly less due to caching).

Wikidata editing with OpenRefine - Inverse Listeria tutorial. Import options for the wikitable.

Checking the reconciliation quality with facets.

Once the project is created, you will notice a few differences with other importers:

many cells are already reconciled: those are the cells which contained a unique wikilink, linking to a page associated to a Wikidata item;

new columns have been created, with the URLs extracted from the references found in the cells. These reference columns are placed to the right of the content column they were extracted from.

Structure of original values and their references.

Even if the cells have been reconciled automatically, make sure you check the quality of these matches. For instance, it is always possible that a wikilink points to a disambiguation page, in which case the cell will be incorrectly matched to a disambiguation item. One simple way to get an idea of the quality of these matches is simply to add columns from these reconciled values (Edit column → Add columns from reconciled column), retrieving properties like instance of (P31) and the ones that relate to the content of your table (such as date of official opening (P1619)). Use facets to explore these values and get a sense of what the items look like.

Cleaning up the data edit

Values not matching the regular expression, isolated with the filter.

Say we want to import the ridership column of this table. This column contains values such as 302.8 (2016): we first need to clean this up, separating the actual ridership from the year. We could for instance do this with a regular expression. It looks like the prevalent format in this column is as follows: a quantity (composed of numbers and a dot), a space, an opening bracket, the year and a closing bracket. This should be matched by the regular expression [\d.]+ \(\d{4}\). You do not need to get this right from the start: just create a Text filter in this column, tick Regular expression, and construct your expression there. In real time, OpenRefine compiles your expression and shows you the rows that match it. By ticking invert you can also see the rows that do not match the expression. These are rows where the value was not provided or where a source is missing, so we can just exclude these rows from our import.

Extracting amounts from a compound cell in Python.

Now that we have figured out a regular expression that captures the format of these cells, we can also use it to split the contents into two parts.^[1] We are first going to add groups to the expression, so that we can extract its parts: ([\d.]+) \((\d{4})\). The first group captures the amount, the second the year. Now, we can create new columns for each group. Click on the ridership column, select Edit column → Add column based on this column. Select Python / Jython as expression language and use the following expression:

import re # this imports the regular expression module
match = re.search(r'([\d.]+) \((\d{4})\)', value) # this matches the value against our regular expression
return match.group(1) # this extracts the first group of characters from the match

Translating amounts to millions.

Give a name to the column (such as ridership_amount) and confirm. Repeat the operation, this time with the second group to create the second column ridership_year. Are we done yet? Not quite, because as the column header says, the ridership is expressed in millions, so we need to multiply our values accordingly. One easy way to do this is to transform the cells in this column (Edit cells → Transform) and simply concatenate "e6" to the end of the value (this is the engineering notation to express quantities in millions).

Creating a schema edit

Now that our data is cleaned up, we can create a Wikidata schema as usual. The first step is to figure out which properties to use. Go to the item for a metro system, such as Algiers Metro (Q728045), and start adding a new statement with the property "ridership". Wikidata suggests patronage (P3872), which looks appropriate after reviewing its property page. The usage instructions of this property suggest the use of the point in time (P585) qualifier to indicate the calendar year in which the ridership was measured.

Click Wikidata → Edit Wikidata schema and build a schema using these properties. Thanks to the wikitable importer, we can use the reference URLs extracted in the column on the right-hand side to create Wikidata references. It can also be useful to record that we imported this data via Wikipedia: we can use Wikimedia import URL (P4656) for this, with a permanent link to the revision we used to perform the extraction (such as https://en.wikipedia.org/w/index.php?title=List_of_metro_systems&oldid=830817609 ).

Wikidata editing with OpenRefine - Inverse Listeria tutorial. Simple schema to add ridership information on Wikidata.

Of course, we can do a similar work for other columns, and add them to the schema, so that they can be uploaded to Wikidata in one go. For instance, the opening date could be added with date of official opening (P1619). Once you are happy with your schema, you can perform the edits as usual, either directly from OpenRefine with the Upload to Wikidata operation, or using QuickStatements.

OpenRefine Wikidata Inverse Listeria tutorial, preview of edits adding ridership information

Edit history edit

The operations described in this tutorial can be reapplied on the project using the following JSON edit history. It is of course specific to the structure of this particular table.

Notes edit

↑ More background about using regular expressions in OpenRefine can be found on OpenRefine's wiki.

[1] More background about using regular expressions in OpenRefine can be found on OpenRefine's wiki.

[1]