Wikidata:Mismatch Finder/Contribute mismatches
Mismatch Finder
Tackling mismatching data between Wikidata and external databases.
Here is the process users should take to contribute mismatches to the Mismatch Store.
Choosing a database
edit- Verify that external identifiers properties that link Wikidata items with the database exist in Wikidata
- For example, if our chosen database is MusicBrainz, we would search for “P:MusicBrainz and find properties like “MusicBrainz release group ID (P436)” that links a Wikidata album to a MusicBrainz album
- Verify that a lot of Wikidata items are linked to that database as we want our code to be useful for a lot of entries (Run a query on query.wikidata.org counting how many items have the external ID property found)
Choosing an attribute
edit- Decide which data type to compare and find mismatches for (e.g. artist data, book data, movie data)
- Find which attribute of the data type to compare
- Ex: MusicBrainz stores the founding date of an artist and Wikidata does as well using the “debut date” property (P10673)
Getting the data
edit- Find a way to access the database's data so we can compare it with Wikidata's
- This can be either through an API, web scraping or a data dump
- Make sure the API is publicly available (not paid). If the API is “unofficial”, verify that it has accurate and up to date information.
- For web scraping, look into a python library like Beautiful Soup
- Get the data from the database using the method chosen above
- You can test out getting one item from the database first to know how to get many
- Write python code to get that data using the requests library for URLs or another method
- Write the code to get the data from the decided attribute to compare with Wikidata. For example, the returned object from the API might be in a Python dict or list so use dict[“key”] to get to it.
Getting the Wikidata items
edit- 2 approaches:
- Write a SPARQL query
- Download a Wikidata dump
- Find which items have the:
- External identifier property researched earlier that link the Wikidata item to the database
- The attribute property decided on
- Execute a SPARQL query on QLever to do this because it lets you download the results quickly as a csv
SELECT DISTINCT ?rank ?join_col ?join_col_guid
WHERE {
?join_col_guid ps:P569 ?join_col.
wd:Q23 p:P569 ?join_col_guid.
?join_col_guid wikibase:rank ?rank.
FILTER(?rank = wikibase:PreferredRank || wikibase:NormalRank)
}
- With the MusicBrainz example, we would be filtering for items that have “MusicBrainz artist ID (P434)” and “debut date” (P10673)”
Comparing the items
edit- There are several ways to get the data from the selected external data source as mentioned above like
- Rest APIs
- Web scraping
- Data dumps
- Once the data for a speciic property is found, write Python code to compare the Wikidata item and external data source item
- Any mismatches found needs to be written in a CSV file
Getting the mismatches formatted as a CSV file
edit- A CSV import file must include the following header row, to describe each column: item_id,statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url,type
- item_id - The item ID of the Wikidata item containing the mismatching statement.
- statement_guid - (Optional) Represents that unique id of the statement on wikidata that contains the mismatching data. If present, must be consistent with the item_id. Can be empty to signify that no matching value was found on Wikidata, in which case the wikidata_value must also be empty.
- property_id - The id of the wikidata property defining the wikidata value of the mismatch.
- wikidata_value - (Optional) The value on wikidata that mismatches an external database. Can be empty (see statement_guid).
- meta_wikidata_value - (Optional) The value on wikidata that represents property calendar/time type.
- external_value - The value in the external database that mismatches a wikidata value.
- external_url - (Optional) A url or uri to the mismatching entity in the external database.
- type - (Optional) A string that contains either the value 'statement' or the value 'qualifier' to indicate where the mismatch occurs. If left empty a value of 'statement' will be assumed.
- Note: The data wikidata_value, external_value, external_url should be limited to a length of 1500 characters maximum.
- All columns must be present; optional values can be left empty, e.g. ,, for empty meta_wikidata_value.
Uploading the mismatches onto the Mismatch Finder
edit- Before uploading all of the mismatches, getting peer-reviewed is a good idea to ensure that the mismatches are valid
- open a ticket with your prepared mismatch file and the Wikidata team will upload your mismatches for you