Wikidata:Events/Data Modelling Days 2023/AlternateReferenceModel

✨---------------✨---------------✨---------------✨---------------✨---------------

Alternate reference model(s)
ArthurPSmith

✨---------------✨---------------✨---------------✨---------------✨---------------

👥 Number of participants (including speakers):
44 (at 14:02 UTC)

🖊️ Notes & links

Slides are available at https://commons.wikimedia.org/wiki/File:Wikidata_Alternate_reference_model.pdf

❓ Questions and discussions
Question here
Answer here
James Heald writes: I think the data structure used by the query service already does this (ie combines references into a single stored entity if their content is the same). But the saving in practice becomes less than it might be if the references start getting "decorated" with additional information specific to a particular statement. (Eg things like "stated in reference as" or "reference supports qualifier" etc). Does your proposal consider this?
Arthur: It hadn't considered it; however the property approach (option 3) would allow for this in the form "see reference 1, reference supports qualifier x" which could be an advantage of that approach.
Sky writes: The example we're looking at is more an issue of overly complicated documentation of references. It's important to maintain an ability to have individual references on each statement because they all may derive from a different provenance trace. Part of why the current model is so powerful is that we can run SPARQL queries for very diverse information coming from all across Wikidata and use prov:wasDerivedFrom to get at the details on claim source. We should have more references pointing to other items in Wikidata where some of the complexity in describing provenance can be better broken out in the graph as a whole.
Arthur: sometimes it makes sense to create separate items for references, but a lot of these don't - for example the duplicated references are often "stated in X", "id Y", "retrieved date Z" and it really doesn't make much sense to add a separate item for each Y and Z when we already have the item X.
Andy: I have a "wish" for a tool to bulk replace long-hand citations with items. See: https://www.wikidata.org/w/index.php?title=Wikidata%3AProject_chat&type=revision&diff=1490103764&oldid=1489990994 - example uses 20 edits, or over 80 mouse clicks, in all!. We just need somone to code it!
Nikki: I don't think the ui would *need* to change, that seems independent of the underlying storage
Arthur: agreed, the more compact storage format could just be translated to the current format for display/API/dump use so nothing on the user end need change.
Eihel: reference property would have the advantage of adding it almost by default to all Items
Nikki: I'm not keen on storing references as qualifiers of another statement, that makes the underlying data model kinda pointless, since references are now mainsnaks and qualifiers, not references
TuukkaH: Solution 4: compress the JSON before storing it or calculating its size?
Arthur i.e. gzip in MySQL storage? There may be some implications for mediawiki search, etc.?
Nicolas Vigneron: I don't like much option 3 but not every complexity can be on the other entity 😉
Nikki: it already reuses hashes (as far as I know) if the references are the same, and we could have a way to edit all the references that are the same on the item without changing the data model at all
Arthur: a user script or gadget to do this would be nice - that seems similar to what Andy is requesting also.
Camillo: +1 on Nikki about editing with one edit all references which are the same
TuukkaH: Does anyone know if the new Wikidata REST API already has a different format for the references?
Ollie (WMDE): https://doc.wikimedia.org/Wikibase/master/php/rest_data_format_differences.html - it's different but still has a hash, and repeated on each statement
Ollie (WMDE): One thing I should have mentioned about the new REST API JSON format, this doesn't not change the format of the JSON that is stored in the database, it is just a different presentation format.
Nikki: I only really see two things we're asking the developers to do, make it possible to edit large items, in whatever way they're willing to implement (whether that's storing it differently, compressing it, increasing the max page size, etc), and make it easier to edit the same reference used multiple times
TuukkaH: @nikki, it's also an API question if the size of the API responses and dumps matters
Lydia Pintscher: What might be useful as a next step is creating a list of the different underlying problems that we're trying to solve here. That could be an rfc or phabricator or something else
Nikki: a discussion on project chat would be ok for asking editors which problems they're having with large items and repeated references

🎯 Key takeaways and outcomes
...
...

☑️ Next steps
Poll: option 2 preferred strongly (developer work needed)

Which solution do you prefer?
By Lydia Pintscher
Solution 1: 3 (17%)
Solution 2: 14 (78%)
Solution 3: 4 (22%)

What's the next step?
Project chat: 4 (31%)
RfC: 6 (46%)
Property Proposal: 3 (23%)
other: 3 (23%)