Wikidata:Mismatch Finder/Collaboration/Purdue Summer of Data 2024

Wikidata, with over 2 billion edits, has become the most edited wiki globally. With its rapid growth has also come errors that are hard to keep track of. Discrepancies between Wikidata and trusted external sources have been identified that could propagate to downstream projects including applications, national records, search engines and other Wikimedia projects like Wikipedia.

To tackle this, the Mismatch Finder was established as a space where downstream projects and the broader Wikimedia community can report disparities between Wikidata and other trusted data sources. Mismatches are then visible on Wikidata, allowing editors to verify and correct them on Wikidata or the source.

This project’s goal is to deliver new useful mismatches for the Mismatch Finder.

The Data Mine Student Project

Data Science students at The Data Mine will work as a team to identify and address differences between Wikidata and external data sources. All work will be open source and released under open licenses. The project is supported by Wikimedia Deutschland.

Participants

Timeline and updates

Onboarding: January 8–21, 2024
  • Participant introductions
  • Setup and learning the tech
  • Exploring the data sources
Weekly Sprints: January 22 – April 07, 2024
  • Deep dives into selected data sources of interest
  • Generating mismatches and modeling them against Wikidata’s data
  • Sending generated mismatches to the community for their feedback
Closing: April 08 – April 26, 2024
  • Documentation of mismatches found and the processes used to derive them
  • Write a tutorial on how to find mismatches using what we’ve learned
  • Final project roundup

Please help us to find the right focus!

We invite the Wikidata community to actively participate in identifying data sources that when compared with Wikidata could generate numerous and significant mismatches. Your insights will guide our focus and contribute to the success of the Mismatch Finder project.

We are looking for datasets that are free to use, easily accessible, and ideally helpful for data that is used on Wikipedia. These are potential data sources that we could work on (based on T304448):

Please let us know which of the datasets you would be most interested in, and what types of discrepancies we should focus on. Suggestions of other data sources would be welcome!

The Outcomes

Stay tuned for updates on the outcomes of the Purdue Summer of Data 2024: Mismatch Finder project. We will provide comprehensive details once the project concludes.