Wikidata:Automated finding references input
A lot of statements on Wikidata don't have any or a suitable reference - about 33% (source). For some of these statements, it is possible to automatically find references, by comparing our data with data in other websites containing structured data (marked up with schema.org markup). We are building a tool now that is looking at an Item and then checks all sites linked via an external ID if they contain such structured data. If a linked site does contain structured data, we compare it to the unreferenced statements on the Item and see if any of them match. If they match we have found a potential new reference for a statement.
We are going to release a number of such references we found in order to get your feedback on them and guide how to continue fine-tuning the system. The references are available on this wiki page. This is only the first step, more will come in the next few weeks.
In the long run, our goal is to keep adding trustworthy sources to the list and to use them to suggest references to editors, and check existing references as well to ensure that they still say what we claim they say. As Wikidata grows, we believe it is crucial to have more automated ways of supporting you in ensuring the integrity of Wikidata’s data. This is hopefully a good step in that direction.
On this page, you will find the current status of the project. You’re welcome to give feedback on the talk page.
See also: Phabricator board
April 2020: first batch
editWe are now analyzing 400 different websites to compare their structured data to ours and find meaningful references. Feel free to look at the first references, and if you encounter an issue or find something weird, please let us know on the talk page.
May 2020: new data and a distributed game
editWe improved our data batch based on your feedback. On top of that, we created a distributed Wikidata game called Reference hunt!. With this game, you get a suggestion of an Item and a reference based on structured data from an external website.
October 2020: data dump and dashboard
editWe added 529K more potential references to the batch. Here is a dump containing the raw references, some of them may contain errors or be irrelevant. A dashboard has been created to help you filter and reuse the data.