Wikidata:Tools/OpenRefine/Edition/Contrôle qualité

This page is a translated version of the page Wikidata:Tools/OpenRefine/Editing/Quality assurance and the translation is 21% complete.
Example issues reported by OpenRefine in a sample project.

This page explains how the Wikidata extension of OpenRefine analyzes edits before they are uploaded to Wikidata.

Vue d'ensemble

Changes are scrutinized before they are uploaded, but also before the current content of the corresponding items is retrieved and merged with the updates. This means that some constraint violations cannot be predicted by the software (for instance, adding a new statement that conflicts with an existing statement on the item). However, this makes it possible to run the checks quickly, even for relatively large batches of edits. Issues are therefore refreshed in real time while the user builds the schema.

As a consequence, not all constraint violations can be detected: the ones that are supported are listed in the Constraint violations section. Conversely, not all issues reported will be flagged as constraint violations on Wikidata: see Generic issues for these.

Reconciliation

You should always assess the quality of your reconciliation results first. OpenRefine has various tools for quality assurance of reconciliation results. For instance:

  • you can analyze the string similarity between your original names and those of the reconciled items (for instance with ReconcileFacetsBest candidate's name edit distance);
  • you can compare the values in your table with those on the items (via a text facet defined by a custom expression);
  • you can facet by type on the reconciled items (add a new column with the types and use a text facet ordered by counts to get a sense of the distribution of types in your reconciled items).

Violations de contraintes

Constraints are retrieved as defined on the properties, using property constraint (P2302).

The following constraints are supported:

  • inverse constraint (Q21510855): OpenRefine assumes that the inverses of the candidate statements are not in Wikidata yet. If you know that the inverse statements are already in Wikidata, you can safely ignore this issue.
  • single-value constraint (Q19474404): this will only trigger if you are adding more than one statement with the property on the same item, but will not detect any existing statement with this property.

A comparison of the supported constraints with respect to other implementations is available at Wikidata:WikiProject property constraints/reports/implementations.

Generic issues

OpenRefine also detects issues that are not flagged (yet) by constraint violations on Wikidata:

  • Statements without references. This does not rely on citation needed constraint (Q54554025): all statements are expected to have references. (The idea is that when importing a dataset, every statement you add should link to this dataset - it does not hurt to do it even for generic properties such as instance of (P31).)
  • Spurious whitespace and non-printable characters in strings (including labels, descriptions and aliases);
  • Self-referential statements (statements which mention the item they belong to);
  • New items created without any label;
  • New items created without any description;

Adding support for a new constraint

If you know Java, contributing a new constraint is easy! Just look at example scrutinizers (that is how constraint checkers are internally called) such as SingleValueScrutinizer or FormatScrutinizer and write a similar class that detects the issue you want to highlight. Write the corresponding test class (such as SingleValueScrutinizerTest or FormatScrutinizerTest) which demonstrates the issues raised by your scrutinizer. Finally, register your scrutinizer in EditInspector so that it gets run with the rest of the scrutinizers on all candidate edits. Submit your code as a pull request to https://github.com/OpenRefine/OpenRefine.

If you need any help with that, do ping User:Pintoch who will be very happy to help.