Wikidata:Events/Data Quality Days 2022/Modeling data

How do we deal with concurrent uses of different properties? The example of modeling data for humans by Epìdosis (Saturday, July 9th - 12:00 UTC)

Presentation edit

Note: the presentation has been updated a few times after the 9th July 2022: this was the version of the page presented at the Data Quality Days 2022.

Issue edit

Definition: a datum can sometimes be expressed in several ways, i.e. through different properties and/or qualifiers and/or a combination of properties and qualifiers.

Easiest patterns (but some cases in #Examples are significantly more complex):

Pattern Explanation
X:A vs Y:A Same value, concurring properties
X:A vs X:B Same property, concurring values
X:A vs Y:B Concurring association property-value

Causes of the issue:

  • banal mistakes by newbies
  • insufficient clarity of how a property should be used (= unclear descriptions and too few constraints)
  • insufficient guidance by WikiProjects on how to model certain data
  • data imported from an external source which models some properties differently from Wikidata (e.g. overextension of P106 due to GND)

Consequences: this lack of uniformity makes queries and other data analyses significantly more difficult and may cause some users (and data reusers) to miss some data, thinking that they don't exist simply because they aren't modeled in the way they predict.

Examples edit

H in "type" marks examples regarding humans (instance of (P31)human (Q5)).

Type Case Discussion {{Autofix}}able Example options
P1480 vs P5102 in qualifiers open yes
  1. qualifier sourcing circumstances (P1480)circa (Q5727902)
  2. qualifier nature of statement (P5102)circa (Q5727902)
P31 vs P140 combined not yet (I think) no (maybe in more steps)
  1. instance of (P31)church building (Q16970) + religion or worldview (P140)Catholicism (Q1841)
  2. instance of (P31)Catholic church building (Q1088552) + religion or worldview (P140)Catholicism (Q1841) (a bit redundant)
  3. [instance of (P31)Catholic church building (Q1088552) (thwarts P140)]
H P106 vs P140 combined not yet no (maybe in more steps)
  1. occupation (P106)theologian (Q1234713) + religion or worldview (P140)Catholicism (Q1841)
  2. occupation (P106)Catholic theologian (Q98833890) + religion or worldview (P140)Catholicism (Q1841) (a bit redundant)
  3. [occupation (P106)Catholic theologian (Q98833890) (thwarts P140)]
H P106 vs P101 combined not yet (I think) no (maybe in more steps)
  1. occupation (P106)romanist (Q2504617)
  2. occupation (P106)philologist (Q13418253) + field of work (P101)Romance studies (Q1277348)
H P39 vs P39 + P708 qualifier not yet (but see this) no!
  1. position held (P39)Roman Catholic bishop of Saint-Jean-de-Maurienne (Q22813013)
  2. position held (P39)diocesan bishop (Q1144278) with qualifier diocese (P708)Roman Catholic Diocese of Saint-Jean-de-Maurienne (Q360645)

Note: the first would imply having an item "bishop of X" for each diocese X; in fact we have just a few of these items due to Wikipedia articles

H P108 vs P39 combined not yet (I think; but see the previous DQD no!
  1. employer (P108)Musée d'Orsay (Q23402) with qualifier position held (P39)museum director (Q22132694)
  2. position held (P39)museum director (Q22132694) with qualifier employer (P108)Musée d'Orsay (Q23402)
H P69 vs P512 combined not yet (I think; but see the previous DQD no!
  1. educated at (P69)University of Oxford (Q34433) with qualifier academic degree (P512)master (Q28046673)
  2. academic degree (P512)master (Q28046673) with qualifier educated at (P69)University of Oxford (Q34433)
H martyrs closed yes
  1. subject has role (P2868)martyr (Q6498826) (approved)
  2. occupation (P106)martyr (Q6498826)
  3. instance of (P31)martyr (Q6498826)
H P106 vs P611 combined closed (seemingly) yes
  1. religious order (P611)Order of Cistercians of the Strict Observance (Q276223) (approved)
  2. religious order (P611)Order of Cistercians of the Strict Observance (Q276223) + occupation (P106)Trappist Cistercian monk (Q99521081) (a bit redundant)
  3. [occupation (P106)Trappist Cistercian monk (Q99521081) (thwarts P611)]
H P3831 vs P1013 vs P518 in qualifiers for P734 and P735 closed yes (but...)
  1. family name (P734)James (Q12188082) with qualifier object has role (P3831)maiden name (Q1376230) (approved)
  2. family name (P734)James (Q12188082) with qualifier criterion used (P1013)maiden name (Q1376230)
  3. family name (P734)James (Q12188082) with qualifier applies to part (P518)maiden name (Q1376230)

(Autofix hasn't still performed the P1013->P3831 fix after more than one month for unclear reasons)

H P106 vs P412 evident yes (but...)
  1. occupation (P106)opera singer (Q2865819) + voice type (P412)bass (Q27911) (correct)
  2. occupation (P106)bass (Q27911) + voice type (P412)bass (Q27911) (wrong, violates P106 value constraint; a bit redundant)
  3. [occupation (P106)bass (Q27911) (wrong, violates P106 value constraint; thwarts P412)]

Autofix supports values, not classes of values; while in Property talk:P106 are listed all the main types of voices, listing all subtypes (e.g. dramatic baritone (Q8243257)) one by one would be problematic, so keeping the constraint completely respected is in fact very difficult

H P106 vs P108 not yet (I think; but see the previous DQD) ?
  1. employer (P108)University of Oxford (Q34433) with qualifier occupation (P106)teacher (Q37226)
  2. occupation (P106)teacher (Q37226) with qualifier employer (P108)University of Oxford (Q34433)
Dates yyyy-01-01 vs yyyy-00-00 evident (see phab:T310981) no of course! (but MatSuBot does)
  1. yyyy-01-01
  2. yyyy-00-00

Detect, decide, enforce edit

Three areas can be analysed:

  • detection process (= detect conflicting data models)
  • decision process (= decide a standard data model)
  • enforcement process (= enforce a standard data model)

Thoughts:

  1. detection process
    1. for the easiest cases, just different properties used for the same item (e.g. martyrdom above), querying all the properties having a certain value (e.g. this for rector (Q212071)) is sufficient to find anomalies
    2. often a certain value is used with the same (correct) property in 99% cases and with other (wrong) ones only in a very few cases, fixable manually
  2. decision process
    1. if there is no clearly correct option for modeling a datum, a discussion should be opened to raise consensus for a standard data model
    2. the discussion is usually opened at the competent WikiProject; pinging also Wikidata:WikiProject Data Quality could be useful
    3. obstacle: sometimes, despite {{Ping project}}, these discussions could go unnoticed by many users and receive very few comments; this significantly slows the process of reaching standard data models (possible palliatives: reporting the discussion at Wikidata:Status updates/Next; ping single users which have used the involved properties most frequently, finding them through NavelGazer)
    4. sometimes it can be useful to boldly close the discussion, although with a few comments, and enforce the standard data model through {{Autofix}} (if it is usable); the edits can effectively raise some attention in users and reanimate the discussion, leading to the confirmation of the enforced data model or to the definition and enforcement of a different one)
  3. enforcement process
    1. always try to enforce the standard data model extensively (if newbies find somewhere the non-standard solution, they could reproduce it)
    2. if the items involved are few (i.e. a few tens), editing them manually is probably the best solution
    3. if the items involved are many, first look at the possibility of using {{Autofix}}, secondly to Wikidata:Bot request
    4. Autofix has the advantages of being easily settable and of running periodically (although the periodicity is unclear), while Bot requests require a bot programmer and usually run once

Main enforcement methods edit

Possibility Patterns fixed Patterns not fixed Pros Cons
(manual intervention) potentially everything nothing fixes directly items; versatile slow; waste of human resources
property constraints X:A (value, not class of values)->X:B
X:A->Y:A
X:A->Y:B
combinations of properties and qualifiers generates constraints violations that can educate users manually editing items and that can be monitored through queries and reports unable to fix directly items
{{Autofix}} X:A (value, not class of values)->X:B
X:A->Y:A
X:A->Y:B
combinations of properties and qualifiers fixes directly items limited range of fixed patterns; edits performed with not-fully-clear periodicity; the fixes presently running cannot be queried in SPARQL; setting new fixes requires some understanding of coding; etc.
DeltaBot fixClaims X:A (value, not class of values)->X:B
X:A->Y:A
X:A->Y:B
others for qualifiers and references
combinations of properties and qualifiers fixes directly items more documentation needed about the types of "action"; the fixes presently running cannot be queried in SPARQL; setting new fixes requires some understanding of coding
bot fix potentially everything nothing fixes directly items requires programming skills (or to find someone with programming skills) and long times (bot request + find a bot programmer + obtain approval of bot task)

Final suggestions edit

  1. reflect on better ways to detect more extensively conflicting data models (e.g. monitoring items used as values by similar properties)
  2. increase the number of subscribers to WikiProjects and particularly to WikiProject Data Quality (although this may conflict, at a certain point, with the 50-users limit of {{Ping project}} ...), in order to increase the potential participants to discussions regarding standard data models
  3. enable users to perform (without recurring to bot requests) some frequent types of substitions presently not supported by {{Autofix}}: Autofix supports X:A->X:B and X:A->Y:A and X:A->Y:B (where A and B are single values, not classes of values); I propose to support X:C->Y:C (with C being all the values resulting from a query; this would solve e.g. P106 vs P412 above) and X:A qualified by Y:B->Y:B qualified by X:A (this would solve e.g. P108 vs P39 combined and P69 vs P512 combined, when a standard will be established)
  4. as of now, using property constraint (P2302)none-of constraint (Q52558054) with qualifiers replacement property (P6824) and replacement value (P9729) only affects constraint violations, but does not imply a periodical fix on items; in order to obtain periodical fixes to items, all these constraints should be replicated in property talks as {{Autofix}}; would it be possible avoiding this duplication, having a bot periodically run on the basis of the constraints? As of now, some standards are stated as constraints, some others as Autofix and some others as both constraints and Autofix (we miss a standard data model for standard data models!): storing standards only as constraints would make them queryable, while Autofix cannot be queried

Notes of the session edit

👥 Number of participants (including speakers): 19

🖊️ Notes & links

  • Presentation: https://www.wikidata.org/wiki/Wikidata:Events/Data_Quality_Days_2022/Modeling_data
  • A datum can sometimes be expressed in several ways, i.e. through different properties and/or qualifiers and/or a combination of properties and qualifiers. But there are issues:
    • banal mistakes by newbies
    • insufficient clarity of how a property should be used (= unclear descriptions and too few constraints)
    • insufficient guidance by WikiProjects on how to model certain data
    • data imported from an external source which models some properties differently from Wikidata (e.g. overextension of P106 due to GND)
  • Consequences: this lack of uniformity makes queries and other data analyses significantly more difficult and may cause some users (and data reusers) to miss some data, thinking that they don't exist simply because they aren't modeled in the way they predict.
  • Examples of modeling data: https://www.wikidata.org/wiki/Wikidata:Events/Data_Quality_Days_2022/Modeling_data#Examples
    • One example is occupation (P106) vs. religion or worldview (P140)
      • occupation (P106) theologian (Q1234713) + religion or worldview (P140) Catholicism (Q1841)
      • occupation (P106) Catholic theologian (Q98833890) + religion or worldview (P140) Catholicism (Q1841) (a bit redundant)
      • [ occupation (P106) Catholic theologian (Q98833890) (thwarts P140)]
      • If someone writes a query that search for "occupation" = "theologian" and "religion of worldview" = "Catholicism". the results wouldn't include those that have "occupation" = "Catholic theologian" and "religion of woldview" = "Catholicism".
      • Note: We have Catholic theologian (Q98833890) as a profession, because German and Czech National Libraries have it as a profession, so we imported those data
    • Another example: occupation (P106) vs field of work (P101) 
      • occupation (P106) romanist (Q2504617)
      • occupation (P106) philologist (Q13418253) + field of work (P101) Romance studies (Q1277348)
      • Again: writing a query of the first kind might miss all results with the second kind
    • And so on... there are many cases in which these kind of "conflict" between styles of modelling humans will have an effect on queries - and so in showing, cleaning up and maintaining data
    • The problem of yyyy-01-01 and yyyy-00-00. Discussion on https://phabricator.wikimedia.org/T310981
    • 3 phases to face the explained issues: detect, decide and enforce. Each phase is described in https://www.wikidata.org/wiki/Wikidata:Events/Data_Quality_Days_2022/Modeling_data#Detect,_decide,_enforce
      • Detection is the "easiest" part
      • Discussions usually happen at wikiprojects, but unfortunately they can be very long
      • It can be possible to reach a standard, but then you have to enforce it - this add slowness to the whole project, it may take a long time
      • Three automated versions to enforce standards: property constraints, Autofix, and bot fixes
    • https://www.wikidata.org/wiki/Wikidata:Events/Data_Quality_Days_2022/Modeling_data#Final_suggestions


❓ Questions and discussions

  • how is the bishop example different from the museum director example? 😒 why does one should be handled this way and the other one the other way? 
    • With bishops we have proper items for dioceses, in other case we don't - we can decide to never use the specific items for bishops, or to create the specific items for them. In case of museum directors, we need to qualify Director with employer:museum, but we won't create an item for Director of Museum X. We just need to find the correct property, and the correct qualifier.
    • We should find the solution that avoids creating unnecessary items - we can also get rid of too specific items and pass to the property+qualifier solution; but we need to decide if we need the position as main value or qualifier.
  • What's the best place to discuss these issues with the broader community, and make decisions? And then how can we enforce them? 
    • The best place to discuss it is possibly the competent Wikiproject, with a ping to Project Data Quality, and then pinging also the talks of the involved properties. 
  • [Comment from Nikki] getting bots to do things properly can be tricky, so I don't think automating all fixes would make sense, but automating easy but tedious fixes so that people don't have to do them does make sense
  • [Comment from Jan] I think we need better definitions of how "occupation" is different from "position held". 
  • Luca: could some of these problems be solved through entity schemas?
  • Rodrigo: I'm not very familiar with writing ShEx, but I think ShEx could also be useful in the enforcement process, have you considered this as a tool?
    • I've taken a look at entity schemas, they're used in a very few cases on humans, but they could help in this
    • [Luca] I'm not an expert on Schemas, but I tried my way with them. They can help defining a schema for describing items - but we're back at step 1: we need to decide how we want to describe a human, or an administrative entity, or anything, and then we write it in ShEx, so that it could be used to enforce such editorial decision
    • Note: ShEx is the language of EntitySchema
  • Jan saying that we have specific items anyway (mayor of london, etc.), either we decide to always or never use them, but in between is not a good idea.
    • Epidosis: deciding not to use [these specific items with position held] could be a viable option, we would just need to decide and enforce it
    • [Nikki] yeah I think we have a bunch of items that exist because they have sitelinks that we wouldn't otherwise have, and we should discourge using those
    • [Luca] +1 on Jan
    • [VIGNERON] +1 on Jan but I would prefer the "never" option ;)
    • [Jan] It is useful though, because it's easy to query for and enable templates like PositionHolderHistory e.g. like on https://www.wikidata.org/wiki/Talk:Q3315958
      • [VIGNERON] I agree it easier, but too easy. And tools should be able to do the same result with qualified generic items, don't model data based on how limited the tools are ;)
  • [Manuel] this could be a good start to fix things, how can we do it?
    • It's complex... we need a discussion, to be addressed in several parts or issues, and afterwards we need ways to enforce them. I didn't say in my presentation that Autofix periodically fixes a problem, while a bot intervention is just a one-off solution, but the problem might come back in the future
    • [Manuel] It's easy to find a solution for a single problem, but generalising is another thing... but this is the necessary next step, and we need to take it as a project, but it's really really difficult. I would be really interested in someone's idea about this.
    • [Camillo] Improving Autofix should be a priority. If we enlarge Autofix to fix more cases, it would solve more minor things. Autofix unfortunately is not querable. Or we can use the constraints? If the constraints could auto-enforce themselves, without having to duplicate them in the talk pages as {{autofix}}, this solution would be more effective from many points of view: constraints are querable; some users probably would find easier to add constraints than to add {{autofix}} in talk pages; we would have just one single way to store "rules" instead of two (constraints + {{autofix}}). Probably we can involve more users this way.
      • [Lydia] https://www.wikidata.org/wiki/Special:Contributions/KrBot seems to be performing edits regularly based on the autofix template 
      • Yeah it seems to do so, but in some cases it doesn't do the job someone required, also code is not public, and the bot owner refused to publish it
      • Massively enforce a standard will have effects, and be very difficult. Instead of deciding and enforcing standards, I would just do some other task instead
    • If property constraints don't enforce themselves, and you need to rely on external tools to enforce them, then we're back at square one - and people will keep losing interest in enforcing rules, constraints, models and the like

🎯 Key takeaways and outcomes

  • ...
  • ...

☑️ Next steps

  • ...
  • ...