Wikidata talk:Mismatch Finder

Latest comment: 3 months ago by Lydia Pintscher (WMDE) in topic What imports are set up aleady?
Mismatch Finder
Tackling mismatching data between Wikidata and external databases.


You are welcome to ask questions or provide us your feedback here!

Geo-data related questions: edit

- How can I register a database for 1:1 matching?

- What is the plan for visualizing the detected extreme distances? ( GPS data - P625 )

context:

I am helping in the https://www.naturalearthdata.com/ - wikidata concordances - and this is a public domain geo-database ( + https://whosonfirst.org/ )

There are a lot of data errors on both sides. ( as usual )

My ideal use case will be:

  • I am exporting the "Natural_earth" locality data to CSV - for any website. ( URL )
  • I am registering the join columns for 1:1 mapping
  • I am registering the matching columns like
    • ne:"name" should be equal wd:"English label"
      • text differences category: exact match ; unaccent match; partial match; low jarowinkler distance ; high jarowinkler distance
    • GPS
      • distance category: | < 5 km "Ideal" | 5-20Km "Maybe" | >=20km "Error" | >=500Km "High Priority Error"
    • ne:"ISO" should be equal the wikidata country code

The text matching is hard - so some custom data cleaning/customisation function would be useful. for example matching "rivers" I am using some regexp precleaning for this words "(river|rivire|rio|le|de|saint|st.|creek|cr.|fork|fk.)"

The GPS comparison is not so easy - you have to prepare

  • point vs point(P625)
  • polygon vs point(P625)
  • ... line vs point(P625)
  • ... line vs. multiple point - ( rivers )
  • ...

Multiple distance category would be ideal

  • POI distance ( like museums .. >1km distance is not ideal )
  • city distance
  • country distance ( as a point )


Thank you for working on this!

--ImreSamu (talk) 08:41, 21 June 2021 (UTC)Reply

In the first versions the tool will not be as sophisticated. Initially it'll just be able to import mismatches that have been found by an external process and present them for review. So you could write a small script comparing naturalearthdata with Wikidata, generate a file containing the mismatching statements and then load that file into the Mismatch Finder. --Lydia Pintscher (WMDE) (talk) 10:37, 7 July 2021 (UTC)Reply

The need to take care of errors at Sources edit

Mohammed_Sadat_(WMDE) Problems we have seen

- Salgo60 (talk) 16:05, 21 June 2021 (UTC)Reply

Yes there will be a way to indicate that the error is in the external source. Given how different the various institutions are in accepting feedback we'll probably have something super simple at the beginning and then see how we can expand it as people are using it. --Lydia Pintscher (WMDE) (talk) 10:43, 7 July 2021 (UTC)Reply


WikiTree WikiTree person ID (P2949) WikiTree+ and the WikiTree Data Doctors Project edit

Maybe a good candidate to synch Mismatch_Finder with...

Email sent to @Mohammed_Sadat_(WMDE), Lesko987a:... Lesko987a please correct if something is wrong as I haven't been part of WikiTree since 2017....

- Salgo60 (talk) 02:40, 17 July 2021 (UTC)Reply

  • Not sure if this is a great testcase: Wikitree is a wiki and Wikidata may not have found yet the optimal way to keep multiple diverging parent/child relations in sync. --- Jura 07:53, 19 July 2021 (UTC)Reply
  • Hi, Data validation is designed to improve data on WikiTree. I could also check the things in the other direction to inform WikiData of the data mismatch or lack of it. I will see how you will design things on wikidata end and if those errors will actually be corrected. But it is often very hard to resolve the data difference, since you must go back to the actual sources to decide what is correct. Most of the time wikitree users don't correct wikidata, since they are not the users here, but lately they comment the data difference on the profiles, since invalid data on the internet tends to be copied around and tends to come back to WikiTree over and over again.

I was thinking of populating wikidata with dates/relations from wikitree, but I didn't decided to do so, since I didn't find an easy way to keep the data up to date. I don't like the one time data dumps, like The Peerage did last year. It caused a lot of problems on WikiTree, since they have many mistakes. Lesko987a (talk) 09:48, 19 July 2021 (UTC)Reply

Lesko987a I think a step 1 could be that when an users has checked your suggestion that decision if it imnoacts Wikidata should be input to Wikidata and hopoefully someone fixed this.... I guess we will see a lot of lesson learned BUT if a skilled person tells this is wrong in WIkidata then that should be a high prio to fix....
  • the problems I have seen with Wikidata <-> Wikitree diffs is that is mostly profiles in areas I have no skills in --> I cant tell if it is correct or not... we have a lot to learn
  • agree about "The Peerage dump" related problems we should not add more frustration instead create an interaction to create trust....
- Salgo60 (talk) 12:49, 19 July 2021 (UTC)Reply
  • If the reverse of the report could easily be generated and regularly refreshed, that could be interesting. Personally, I would be ok with getting regular imports of dates and places from WikiTree. Either they would be new for Wikidata or get appropriate ranks. It's a bit more complicated with family relationship, but maybe someone has figured out in the meantime a reasonable way to keep multiple possibilities consistent. --- Jura 10:40, 19 July 2021 (UTC)Reply

Support in the api for external reviewers decision edit

- Salgo60 (talk) 10:03, 19 July 2021 (UTC)Reply

Bad link edit

I think your placeholder link is meant to be: https://mismatch-finder.toolforge.org/ - Fuzheado (talk) 20:03, 21 June 2021 (UTC)Reply

ah, yes. Thanks! -Mohammed Sadat (WMDE) (talk) 07:16, 22 June 2021 (UTC)Reply

A few questions edit

Would the mismatches store have an API that I (or someone else) could submit potential mismatches to? Would the mismatches system automatically determine which values failed to match between Wikidata vs an external database? e.g. would I write a script to query Wikidata and an external database (let's say IGDB) to find any video games that have a different release date on one vs the other, or would I write a script to dump every video game on IGDB with its release date and then the mismatches system finds mismatches itself?

My other question: I run vglist.co, which is a website that pulls a lot of data for video games from Wikidata. Would you expect that I could implement a "Report Issue" feature on my site where the user would be able to report a data problem that'd forward it to this mismatches system? e.g. a user sees that the release date for Super Mario Bros is wrong, reports that, and then my site would forward that report into the mismatches system (although I'm not sure how useful that'd be, since the report would only have the release date from vglist, which would be identical to the one in Wikidata)?

Thanks for all your work as usual, this tool looks like it could be very useful :) Nicereddy (talk) 04:47, 30 June 2021 (UTC)Reply

For your first question: you'll have a way to upload a CSV with the mismatches. I'm not yet sure if we'll be able to have that in the first version or if we'll go with opening a ticket in phabricator and then we upload it initially. Either way there will be a way to do that in the future.
For your second question: that's an interesting usecase. I think it should be possible with a few hacks. We could for example abuse the mismatching value and make it say "user reported a mistake on vglist.co" or so. Would be cool! --Lydia Pintscher (WMDE) (talk) 10:49, 7 July 2021 (UTC)Reply

Feedback on File:Table Mismatch Finder High Fidelity Mockup.jpg edit

 

Interesting approach. Looks promising. Just a few points:

  • Formatting tweaks:
    • If "184746" is the QID (e.g. Q184746), the "Q" should be included.
    • Property labels should match those on Wikidata ("date of birth", not "Date of birth", see Property:P569)
    • It's unclear where the dates in the sample would link to
  • Missing info:
    • The key used to match Wikidata and the external reference should be included
    • provide a link to the external resource (based on a formatter URL, see phab:T285851#7220098)
    • If mismatches based on different catalogues are presented, the catalogue should be identified too
    • Note that we could have multiple catalogues with "VIAF ID" as key and these aren't necessarily with data from VIAF
  • Status:
    • About various options for "status", see Wikidata:Project_chat#Fun_with_Mismatches:_typology.
    • clickable icons to select some status might help: e.g.
      • next to values from Wikidata: incorrect, preferred, conflation
      • next to values from External source: incorrect, preferred, conflation
      • between the two: both are equally correct
      • next to key: key mismatch
  • For "upload" to work:
    • each catalogue should have information associated with it to populate references (see phab:T285851#7220098).
    • a way to map values of the external source to Wikidata items should be provided (e.g. every value "male" from a catalogue → Q6581097). The screen should have a link to go there.
  • There should probably be also a screen to view mismatches from a given catalogue only

@Mattia Capozzi (WMDE):

--- Jura 07:05, 19 July 2021 (UTC)Reply

Detection of circular mismatches and historical edits with detected mismatch. edit

I'm looking forward to Mismatch Finder being available for further testing. One question I have is how will Mismatch Finder help in monitoring for circular or past conflicting changes due to two or more external databases having conflicting values for the same Wikidata Q and each changing the Wikidata value, but not their own? Since not all changes are made through Mismatch Finder will there be any analysis of a Q's change history (including merged item history) per Q with checks for past values and sources that are conflicting. It is important to bring those to the attention of users so a more complex situation could be highlighted for further investigation and prevent circular changes due to system assisted editing. Wolfgang8741 (talk) 14:33, 29 July 2021 (UTC)Reply

At least initially the tool will unfortunately not be able to deal very well with this. But I will note it down as something to figure out. Thanks! -- Lydia Pintscher (WMDE) (talk) 16:13, 18 October 2021 (UTC)Reply

Mismatch from Wikipedia edit

Hi Firstly, thank you for working on this. I have no idea how I missed this project for so long in the weekly updates. I see the latest update on the site is from June, maybe it's time for an overhaul of the pages? :)

Now, down to business: I see in the sections above that I can import mismatches from a script. As a matter of fact, I have such a script identifying dob/dod differences between Wikidata and RO.wp. My questions are:

  1. Are Wikipedia articles a valid datasource for this project?
  2. Can I already import my data somewhere? If yes, could you point me to some docs?

Thank you. Strainu (talk) 18:46, 16 October 2021 (UTC)Reply

Yeah the page needs an overhaul by now :D
I would say Wikipedias can definitely be considered, yes.
Right now you unfortunately can not upload your data yet because we are not quite ready for it yet. But the instructions are here already: https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md -- Lydia Pintscher (WMDE) (talk) 16:15, 18 October 2021 (UTC)Reply

Can we start use the tool? edit

I have some issues between WIkidata and Nobelprize.org with data about Nobelprize winners that I would like to test use this tool to track. Please let me know how to do that

- Salgo60 (talk) 23:10, 4 February 2022 (UTC)Reply

We are ironing out a few remaining issues but should be ready to go in the next days. Lydia Pintscher (WMDE) (talk) 10:57, 5 February 2022 (UTC)Reply

Paraphrasing concerns from Tagishsimon edit

Some interesting points are made in this Twitter thread, by User:Tagishsimon; to paraphrase:

  • where can we see which catalogues Mismatch Finder checks against?
  • [bug] - "if I throw 275 items at it, I get a 431 error. If I throw 500 items at it ... does nothing at all."
  • no integration with tools [like] Petscan, QuickStatements, Mix'n'Match, WDQS, or pagepile, etc.

But do read the whole thread. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:11, 23 February 2022 (UTC)Reply

Thanks Andy,
There is definitely still a lot that can be improved about the Mismatch Finder. We wanted to get this out now to hear where additional work makes sense and is most needed. So far I've definitely heard discoverability of the data. I'm thinking of potentially solving this with a button to get random mismatches and/or a clearer list of the external sources that are currently in the store. You can see the most recent uploads of mismatches here: https://mismatch-finder.toolforge.org/store/imports but that can certainly be improved more.
As for integration in other tools: I am not yet sure how that would look like but I'd love to hear if someone has concrete ideas/requests.
It doing nothing: That shouldn't happen and we'll look into it.
In general the tool has a bit of a cold-start problem because initially we don't have a ton of mismatches uploaded into the tool. We hope that will change over the next days and weeks. (We're already in touch with a few people who said they'd be willing and are able to provide us with mismatches for the Mismatch Finder.)
Cheers Lydia Pintscher (WMDE) (talk) 15:49, 23 February 2022 (UTC)Reply
Does this work now?
  • I got an email from alan.ang@wikimedia.de, Partner Manager (Wikidata) "in case you are having mismatch issues between your data and that of Wikidata’s, you may wish to check out the Mismatch Finder tool (see attached). With Mismatch Finder, you will be able to inform Wikidata editors of the mismatches between your data and that of Wikidata’s. Editors will then be able to reconcile these mismatches that eventually improve the quality of the data in your projects."
  • But https://mismatch-finder.toolforge.org/random says "There are currently no mismatches available for review"
  • Can I submit mismatches? is it at https://mismatch-finder.toolforge.org/store/imports?
  • A lot of the CSVs at https://mismatch-finder.toolforge.org/store/imports are red
To sum it up, the tool appears still to be problematic.(cc @Pigsonthewing @Salgo60)
Cheers! Vladimir Alexiev (talk) 20:59, 26 January 2023 (UTC)Reply
Sorry for only getting back to you now.
Yes Alan reached out to various people who might be in a position to provide mismatches because they for example have internal quality assurance processes for Wikidata's data.
When you looked at it all previously uploaded mismatches had expired. We have new uploads now and they are going to be monthly. We are working on getting more.
The imports page was mostly red because there was an upload that wasn't yet adapted to the new upload CSV format. It has now been adjusted as well.
Yes you can upload mismatches. I'll need to add your account to the allow list first. (We put that in place for the beginning to have a bit more control over what goes into the system initially.) Would you like me to? Alternatively you can also send the CSV to me and I will handle the upload. I'll also document it a bit better that this is necessary. Lydia Pintscher (WMDE) (talk) 15:30, 9 February 2023 (UTC)Reply

What imports are set up aleady? edit

Is there a list of imports that are done one-off/on a continuous basis?

For example, I am sitting on a huge amount of potential mismatches, has anyone imported those? Magnus Manske (talk) 09:37, 6 December 2023 (UTC)Reply

Hi @Magnus Manske,
You can see the latest uploads at https://mismatch-finder.toolforge.org/store/imports. As you can see only Mike Peel has so far set up regular uploads that cover mismatches between English Wikipedia and Wikidata. I think your additional mismatches would be very useful. Are you interested in uploading them yourself? Alternatively we have a students team starting to work on getting more mismatches early next year and it might be a good task for them to get started to get your data into the right format and upload it. Lydia Pintscher (WMDE) (talk) 16:56, 8 December 2023 (UTC)Reply
I made some views in the mix'n'match DB, and expose them as JSON:
  • duplicate_items is a list of potentially duplicate items
  • mismatched_items is a list of, well, more potentially duplicate items
  • time_mismatch is a list of items that have different time values (usually birth/death) with a source

There is one more (multile values for an external ID property on WD) but that can be found better with SPARQL these days... Let me know if this works for you, and if you prefer the toolforge DB views instead. --Magnus Manske (talk) 11:18, 17 January 2024 (UTC)Reply

@Magnus ManskeThank you! We'll have a look. Lydia Pintscher (WMDE) (talk) 11:32, 17 January 2024 (UTC)Reply
Return to the project page "Mismatch Finder".