Wikidata:Tools/OpenRefine/Editing/Advanced schemas

Sometimes your data is not as simple as a normal table, or the sort of statements that you want to do varies on each row. This document explains how to work around these cases.

Hierarchical data edit

Sometimes your source provides data in a structured format, such as XML, JSON or RDF. OpenRefine can import these files and will convert them to tables. These tables will reflect some of the hierarchy in the file by means of null cells.

{
  "artists": [
    {
      "name": "Gilberto Gil",
      "songs": [
        "Toda menina Bahiana",
        "Expresso 2222",
        "Refazenda"
      ]
    },
    {
      "name": "Hiromi Uehara",
      "songs": [
        "Desire",
        "Deep Into The Night"
      ]
    }
  ]
}
    

OpenRefine will use its records mode, which treats records as atomic objects. You can see that it counts 2 records even if there are 5 rows, and the shading of the rows reflects that. You can learn more about the differences between the records mode and the rows mode in this article.

The Wikidata extension always works in rows mode, so if we want to add statements which reference both the artist and the song, we need to fill the null cells with the corresponding artist. You can do this with the Fill down operation (in the Edit cells menu for this column). This function will copy not just cell values but also reconciliation results.

 
OpenRefine editing manual: example of using the fill down function when working with hierarchical data.

Conditional additions edit

Sometimes you want to add a statement only in some conditions.

The workflow to achieve this looks like this:

  • Use facets to select the rows where you do not want to add any information;
  • Blank out the cells in the column that contain the information you want to add. If you do not want to lose this information, you can create a copy of the column beforehand;
  • Remove your facets to see all rows again;
  • Create a schema using the column you partially blanked out as statement value.

For instance, you could want to add a statement only if there is no statement with this property on the target item. Consider the following table:

 
Initial state of the table, as part of a tutorial on conditional Wikidata uplods in OpenRefine

We want to upload these altitudes to Wikidata, but only for the mountain passes where the altitude is not known yet. To do this, we fetch the existing altitudes using the "Add columns from reconciled values" function, which gives us this table:

 
Second state of the table, as part of a tutorial on conditional Wikidata uplods in OpenRefine

We create a facet on the "elevation above sea level" (fetched from Wikidata) to select only the non-blank values. Then, we clear out the corresponding cells in the "altitude" column using the "Edit cells" -> "Common transforms" -> "To null" function. We can then remove the blank facet, and we get this table:

 
Final state of the table, as part of a tutorial on conditional Wikidata uplods in OpenRefine

We can now upload our data using the first and third columns in a Wikidata schema.

Varying properties edit

Sometimes you wish you could use column variables for properties in your schema. It is currently not possible, first because we do not have a reconciliation service for properties yet, but also because allowing varying properties in a statement would mean that these properties could potentially have different datatypes, which would break the structure of the schema.

If you only want to use a few properties, there is a way to go around this problem. For instance, say you have a column of altitudes and a column that indicates whether you should add it as maximum operating altitude (P2254) or as elevation above sea level (P2044).

 
Wikidata editing with OpenRefine: example project where varying properties are needed. Initial state of the project.

Create a text facet on the type column. Filter to keep only the altitude values. Add a new column based on the metres column, by keeping the default expression (value) which just copies the existing values. Then, select the maximum operating altitude value in the facet and do the same. Reset the facet, you should have obtained something like this:

 
Wikidata editing with OpenRefine: example project where varying properties are needed. Transformed project.

Now you can use the fact that empty cells will be ignored by the Wikidata extension. Just build the following schema:

 
Wikidata editing with OpenRefine: example project where varying properties are needed. Schema.

Even if it looks like we are adding both properties on the same item, in practice the columns are never non-empty simultaneously, so we are effectively adding the values on different items with the appropriate properties.

Adapting to existing data on Wikidata edit

Sometimes you want to create statements only if there are no such statements on the item yet. Here is one way to achieve this:

  • first, retrieve the existing values from Wikidata first, using the Edit columnsAdd columns from reconciled values action;
  • second, create a facet by null on the newly created column that contains the information you want to control against;
  • select the non-null rows (value false);
  • clear the contents of the column where your source values are (Edit cellsCommon transformationsTo null).

You can now construct your schema as usual - null values will be ignored when generating the statements. You should have obtained a project that looks like this:

 
Wikidata editing with OpenRefine. Advanced schema construction: avoiding existing values.

You can also use this method to add statements only if no referenced value is already on Wikidata. When adding columns from reconciled values, click configure and filter statements according to the referencing level that you require:

 
Wikidata editing with OpenRefine. Advanced schema construction: configuring the retrieval of existing values.

Note that you can also filter by rank, and return the number of matching values instead of the values themselves (which can potentially speed up fetching if you are dealing with a large number of values for each item).