Wikidata:WikidataCon 2017/Notes/Using Shape Expressions for data quality and consistency in Wikidata

Title: Using Shape Expressions for data quality and consistency in Wikidata

Note-taker(s): Gikü, Thiemo

Speaker(s) edit

Name or username: Andra Waagmeester, Katherine Thornton, Lucas Werkmeister, Eric Prud'hommeaux (not attending), Gregory Stupp

Useful links

slides: https://shexspec.github.io/talks/2017/10-28-wikidatacon/index.html#(1)

http://shex.io

Abstract edit

As a truly open data infrastructure, community issues such as disagreement, bias, human error, vandalism, etc. manifest themselves on Wikidata. From a curator's perspective, it can be challenging at times to filter through the different Wikidata views while maintaining one's own definitions and standards. Whether stemming from benign differences in opinions/views, or more malignant forms of vandalism or the introduction of low quality evidence, public databases face extra challenges in providing data quality in the public domain. Here we propose the use of W3C Shape Expressions (ShEx: https://shexspec.github.io/primer/) as a toolkit to model, validate and filter the interactions between designated public resources and Wikidata. It is a language for expressing constraints on RDF graph and a schema language for graphs. Wikidata is fundamentally a graph, so ShEx can be used to validate Wikidata items, communicate expected graph patterns, and generate user interfaces and interface code. It will also allow us to efficiently:

  1. Exchange and understand each other’s models
  2. Express a shared model of our footprint in Wikidata
  3. Agilely develop and test that model against sample data and evolve
  4. Catch disagreement, inconsistencies or errors efficiently at input time or in batch inspections.

Collaborative notes of the session edit

One can use Shape Expressions (abbr. ShEx) to validate their data before mass import.

Shape Expression (ShEx) Schemas describe how data should look like. Example is explained in detail.

Schemas can contain other schemas as part of their definition.

Validation results are produced.

Simple example: Schema asks for a "title" of type "LITERAL", which will mark literals like "Oliver Twist" as ok, and other values like IRIs (e.g. <http://…>) as invalid.

Doing such validation before upload means don't upload it. If it's already on the server, then fix it.

More Shape examples from the slides are explained in more detail.

First real example: a Shape schema that checks the consistency of diseases on Wikidata.

Checks if an item is an instance of "disease".

That must be referenced, and references must also follow a specific schema (certain properties must be present).

Other properties on the same item must as well be referenced.

Three versions to write ShEx: ShExC was used in the talk. This is then converted to ShExJ. There is also ShExJSON. All three are interchangable.

Open source implementations in many languages exist.

SHACL is similar to ShEx. Impemented in a comercial IDE.

While ShEx is like XML Schema, SHACL is like Schematron.

Questions / Answers edit

Q: What you think Wikidata should do with it?

Experiment outside of Wikidata first. Wikidata should say what it wants.

Q: Should we maintain external mappings to deal with the inconsistencies in Wikidata, or enforce external schemas?

"Data modeling in Wikidata sucks." So don't do this in Wikidata, but outside, and feed back later what you learned.

Q: What is the difference between ShEx and Onthology methods?

Ontologies describe the world, Shape expressions describe the data.

Projects bridging between these exist.

Q: Whats the status of Shape as a community group? Why do the slides say "2005–2010"?

It's an ongoing project.

The "2005" is from the slide framework. Sorry.

Q: How close is it to a recommendation?

It lost. SHACL won.

We might go for ISO.

Q: Are there converters between ShEx and SHACL?

Depends. Something like Shacless. No 100% conversion possible.