Wikidata:Tools/OpenRefine/Editing/Tutorials/Third-party reconciliation
Sometimes, the source you want to import data from is huge. For instance, data sources such as company registers hold much more records than Wikidata will ever have in the corresponding domain. In that case, the usual workflow of loading the source database in OpenRefine and reconciling it to Wikidata is completely impractical - the databases are too big, reconciliation will take ages and will very rarely surface good matches (because the vast majority of records from the source database do not and should not exist in Wikidata).
This tutorial explains how to turn the problem around: we will instead extract existing Wikidata items with a SPARQL query that targets the corresponding domain, and reconcile these items against our data source. Our goal will be to add authority control identifiers such as VIAF cluster ID (P214) and GND ID (P227) to items about people. We will use the LOBID reconciliation service, which lets us match records against the Integrated Authority File (Q36578) (GND).
Extracting target items with a SPARQL query
editSay we are interested in improving the linkage of German researchers. We can retrieve a list of German researchers missing a GND ID (P227) like this:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q5;
wdt:P106 wd:Q1650915;
wdt:P27 wd:Q183.
FILTER NOT EXISTS { ?item wdt:P227 ?gnd }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100
Of course this query (and its limit) are arbitrary - we could equally look for Brazilian organizations or Latvian places. The goal is simply to narrow down the domain to items which are likely to have an entry in the target database.
Reconciling with GND
editImport the results of this query in OpenRefine. The first column contains Qids, which can directly be reconciled to Wikidata (Reconcile → Start reconciling and choose the Wikidata service). We will also reconcile the second column, but this time against GND itself. To do that, click Reconcile → Start reconciling and Add standard service. Use the address of the GND reconciliation service run by LOBID: https://lobid.org/gnd/reconcile
Just like for Wikidata, you can restrict the reconciliation by type and refine it via properties (see the documentation of the service for more details). You can then match items against GND:
Retrieving the identifiers
editOnce you have matched items, you can obtain the GND id by adding a column with the expression cell.recon.match.id
, and you can obtain the reference name in GND with cell.recon.match.name
. You can also obtain this information (and much more) by using the Add columns from reconciled values operation:
Adding the ids to Wikidata
editWe can then create a schema to add the identifiers to Wikidata. You can also add the reference name from GND as alias to the items:
This gives the following candidate edits:
These edits can then be uploaded to Wikidata.
Other reconciliable data sources
editVarious other data sources can be queried via reconciliation services. Here are a few:
- Virtual International Authority File (Q54919):
http://refine.codefork.com/reconcile/viaf
(docs, consider running the interface locally for heavy uses)
You can find other reconciliation services in the reconciliation test bench.
It is possible to create your own reconciliation interface for other databases, for instance via reconcile-csv, conciliator or by implementing the reconciliation API yourself.