Wikidata:Requests for permissions/Bot/Soweego bot 4
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 16:02, 5 September 2021 (UTC)[reply]
Soweego bot 4 edit
Soweego bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Hjfocs (talk • contribs • logs)
Task/s: metadata-based validation of existing items, as part of the m:Grants:Project/Hjfocs/soweego_2 Wikimedia Project Grant. This bot partners with Reinheitsgebot, operated by Magnus Manske under the mix'n'match umbrella.
Function details:
given a target catalog such as MusicBrainz (Q14005), run metadata-based validation over identifier statements that already exist in Wikidata and perform 3 2 types of edit:
- add or reference statements based on available metadata in the target catalog entry;
- rank as preferred identifier statements that share all metadata with the target catalog entry;
rank as deprecate identifier statements that do not share any metadata with the target catalog entry.
Toy example: Wikidata states that Elvis Presley was born on January 8, 1935 in Tupelo, while MusicBrainz states that he was born in 1934 in Memphis. Action = add 2 referenced statements with MusicBrainz values to the Elvis Presley item.
Type 1 edit example: Subject item = Lee Grant (Q230184)
date of birth |
| ||||||||||||||||||
add value |
Test edits: [1]
Related pages:
- proposal: m:Grants:Project/Hjfocs/soweego_2#How:_the_solution, see criterion 3;
- chat: Wikidata:Project_chat/Archive/2021/07#Item_validation_criteria.
--Hjfocs (talk) 15:35, 6 August 2021 (UTC)[reply]
- Oppose 1) referring to this test edit the bot is importing a year precision date where the source, IMDb, has a day precision date 2) the item already has IMDb-sources day-precision date statement which has been ignored by the bot: now the item has two IMDb DoB statements, the values of which differ. 3) It's not clear why we would want to rank as preferred identifier statements that share all metadata with the target catalog entry; the catalog may well have duplicates; other ID statements may be equally valid, but the bot is not doing anything to examine them in relation to its decision to promote one of the IDs. Nor is it clear what "share all metadata with the target catalog entry" means. The test data set does not demonstrate this part of the functionality, so we are none the wiser. 4) For much the same reason, it is not clear why we would want to rank as deprecated identifier statements that do not share any metadata with the target catalog entry. If the catalog has duplicates, we deprecate IDs simply because they are not the same as whichever other ID the bot happens to be looking at? Again, the test data set does not demonstrate this function. --Tagishsimon (talk) 18:00, 6 August 2021 (UTC)[reply]
- Thanks for the constructive criticism, let me address your concerns:
- 1) I acknowledge that the IMDb Website displays a day-precision date for the test edit you mentioned, see [2]. Unfortunately, soweego consumes the available IMDb datasets, which seem to not offer the same precision, and Web scraping IMDb is out of scope;
- 2) I fully agree that in case of a less precise value coming from a given catalog, the Wikidata one should take the priority, and the statement should not be added by the bot. I have opened an issue, propose to revert the relevant test edits, and also tackle point 1 with the same solution;
- 3 & 4) the motivation stems from the following intuition: if Wikidata and a given catalog completely agree on an item, then we should mark the link between the two records as confident, and viceversa if there's full disagreement. This can be implemented by preferring or deprecating the rank of the corresponding Wikidata identifier statement. The solution can also disambiguate potential catalog duplicates that are actually homonyms (read different entities): for instance, suppose Elvis Presley (Q303) has 2 MusicBrainz identifiers, one pointing to the singer, and another pointing to a tribute band. These two records are likely to have different metadata, such as birth/inception dates. By comparing each of the two MusicBrainz records with Wikidata, we would end up validating the correct record (the singer), thus preferring it over the wrong one.
- I'd like to add that in case of actual catalog duplicates, we already have something in mind, i.e., [3], [4], and [5].
- With respect to other identifier statements linking to different catalogs, I don't think that they would be impacted (at least from a mere knowledge base querying perspective), or am I missing some use case?
- Finally, when you argue that the test edits do not demonstrate the metadata sharing, please bear in mind that they reflect the final output: if you are interested in which specific pieces of data are shared, I can provide the system debug-level logs. Or perhaps you think it would make sense to explicitly add a reference to every Wikidata statement that matches a piece of data in the catalog entry? This could be done in principle.
- Hope this addresses the feedback. Cheers --Hjfocs (talk) 10:03, 9 August 2021 (UTC)[reply]
- I think when ranking preferred you should be considering the precision of the dates. We don't want a year precision date to be preferred over a day precision if both are well cited and agree. BrokenSegue (talk) 20:51, 6 August 2021 (UTC)[reply]
- The ranking elements appear to apply to identifier statements, which I take to be MusicBrainz ID / IMDb ID rather than dates. --Tagishsimon (talk) 20:53, 6 August 2021 (UTC)[reply]
- Yes, that is correct. Cheers --Hjfocs (talk) 10:03, 9 August 2021 (UTC)[reply]
[UPDATE 1] Additional test edits: we integrated Tagishsimon's relevant feedback, and propose 2 new batches of test edits.
- In [6] we make it explicit that statements are shared with a given target catalog (here MusicBrainz (Q14005) by means of a reference. This exposes a hidden functionality of the validation process; moreover, we believe it can improve data quality in terms of references addition;
- in [7] we show further biographical statements involved in the validation process, besides dates, i.e., place of birth (P19), place of death (P20), and sex or gender (P21).
- Minor comments:
- we manually fixed (Hjfocs did) a few problematic test edits.
- Due to small bug, statements that should have sex or gender (P21) = female (Q6581072) have female organism (Q43445) instead: ;
- bad place of birth (P19) value in [8].
[UPDATE 2] Don't deprecate: we realized that not sharing any biographical data is not a sufficient reason for deprecating the identifier statement: there may be other kinds of shared data, typically URLs. Therefore, we removed the third type of edit from this request.
@Lymantria, Ymblanter: relevant feedback should be addressed, can you please look at this request? Thank you in advance for your time. --Hjfocs (talk) 14:28, 31 August 2021 (UTC)[reply]