Open main menu

Wikidata:Tools/OpenRefine/Editing/Tutorials/Third-party reconciliation

Other languages:
English • ‎français

Sometimes, the source you want to import data from is huge. For instance, data sources such as company registers hold much more records than Wikidata will ever have in the corresponding domain. In that case, the usual workflow of loading the source database in OpenRefine and reconciling it to Wikidata is completely impractical - the databases are too big, reconciliation will take ages and will very rarely surface good matches (because the vast majority of records from the source database do not and should not exist in Wikidata).

This tutorial explains how to turn the problem around: we will instead extract existing Wikidata items with a SPARQL query that targets the corresponding domain, and reconcile these items against our data source. Our goal will be to add authority control identifiers such as VIAF ID (P214) and GND ID (P227) to items about people. We will use the LOBID reconciliation service, which lets us match records against the Integrated Authority File (Q36578) (GND).

Extracting target items with a SPARQL queryEdit

Say we are interested in improving the linkage of German researchers. We can retrieve a list of German researchers missing a GND ID (P227) like this:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P106 wd:Q1650915;
        wdt:P27 wd:Q183.
  FILTER NOT EXISTS { ?item wdt:P227 ?gnd }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100

Try it!

Of course this query (and its limit) are arbitrary - we could equally look for Brazilian organizations or Latvian places. The goal is simply to narrow down the domain to items which are likely to have an entry in the target database.

Reconciling with GNDEdit

Import the results of this query in OpenRefine. The first column contains Qids, which can directly be reconciled to Wikidata (ReconcileStart reconciling and choose the Wikidata service). We will also reconcile the second column, but this time against GND itself. To do that, click ReconcileStart reconciling and Add standard service. Use the address of the GND reconciliation service run by LOBID: https://lobid.org/gnd/reconcile

 

Just like for Wikidata, you can restrict the reconciliation by type and refine it via properties (see the documentation of the service for more details). You can then match items against GND:

 

Retrieving the identifiersEdit

Once you have matched items, you can obtain the GND id by adding a column with the expression cell.recon.match.id, and you can obtain the reference name in GND with cell.recon.match.name. You can also obtain this information (and much more) by using the Add columns from reconciled values operation:

 

Adding the ids to WikidataEdit

We can then create a schema to add the identifiers to Wikidata. You can also add the reference name from GND as alias to the items:

 

This gives the following candidate edits:

 

These edits can then be uploaded to Wikidata.

Other reconciliable data sourcesEdit