Wikidata:Mismatch Finder/Contribute mismatches

Mismatch Finder
Tackling mismatching data between Wikidata and external databases.



Here is the process users should take to contribute mismatches to the Mismatch Store.

Choosing a database

edit
  1. Verify that external identifiers properties that link Wikidata items with the database exist in Wikidata
  2. For example, if our chosen database is MusicBrainz, we would search for “P:MusicBrainz and find properties like “MusicBrainz release group ID (P436)” that links a Wikidata album to a MusicBrainz album
  3. Verify that a lot of Wikidata items are linked to that database as we want our code to be useful for a lot of entries (Run a query on query.wikidata.org counting how many items have the external ID property found)

Choosing an attribute

edit
  1. Decide which data type to compare and find mismatches for (e.g. artist data, book data, movie data)
  2. Find which attribute of the data type to compare
  3. Ex: MusicBrainz stores the founding date of an artist and Wikidata does as well using the “debut date” property (P10673)

Getting the data

edit
  1. Find a way to access the database's data so we can compare it with Wikidata's
    1. This can be either through an API, web scraping or a data dump
    2. Make sure the API is publicly available (not paid). If the API is “unofficial”, verify that it has accurate and up to date information.
    3. For web scraping, look into a python library like Beautiful Soup
  2. Get the data from the database using the method chosen above
    1. You can test out getting one item from the database first to know how to get many
    2. Write python code to get that data using the requests library for URLs or another method
  3. Write the code to get the data from the decided attribute to compare with Wikidata. For example, the returned object from the API might be in a Python dict or list so use dict[“key”] to get to it.

Getting the Wikidata items

edit
  1. 2 approaches:
    1. Write a SPARQL query
    2. Download a Wikidata dump
  2. Find which items have the:
    1. External identifier property researched earlier that link the Wikidata item to the database
    2. The attribute property decided on
  3. Execute a SPARQL query on QLever to do this because it lets you download the results quickly as a csv
SELECT DISTINCT ?rank ?join_col ?join_col_guid
WHERE {
  
  ?join_col_guid ps:P569 ?join_col.
  wd:Q23 p:P569 ?join_col_guid.
  ?join_col_guid wikibase:rank ?rank.
  FILTER(?rank = wikibase:PreferredRank || wikibase:NormalRank)
}
  1. With the MusicBrainz example, we would be filtering for items that have “MusicBrainz artist ID (P434)” and “debut date” (P10673)”

Comparing the items

edit
  1. There are several ways to get the data from the selected external data source as mentioned above like
    1. Rest APIs
    2. Web scraping
    3. Data dumps
  2. Once the data for a speciic property is found, write Python code to compare the Wikidata item and external data source item
    1. Any mismatches found needs to be written in a CSV file

Getting the mismatches formatted as a CSV file

edit
  1. A CSV import file must include the following header row, to describe each column: item_id,statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url,type
  2. item_id - The item ID of the Wikidata item containing the mismatching statement.
  3. statement_guid - (Optional) Represents that unique id of the statement on wikidata that contains the mismatching data. If present, must be consistent with the item_id. Can be empty to signify that no matching value was found on Wikidata, in which case the wikidata_value must also be empty.
  4. property_id - The id of the wikidata property defining the wikidata value of the mismatch.
  5. wikidata_value - (Optional) The value on wikidata that mismatches an external database. Can be empty (see statement_guid).
  6. meta_wikidata_value - (Optional) The value on wikidata that represents property calendar/time type.
  7. external_value - The value in the external database that mismatches a wikidata value.
  8. external_url - (Optional) A url or uri to the mismatching entity in the external database.
  9. type - (Optional) A string that contains either the value 'statement' or the value 'qualifier' to indicate where the mismatch occurs. If left empty a value of 'statement' will be assumed.
  10. Note: The data wikidata_value, external_value, external_url should be limited to a length of 1500 characters maximum.
  11. All columns must be present; optional values can be left empty, e.g. ,, for empty meta_wikidata_value.

Uploading the mismatches onto the Mismatch Finder

edit
  1. Before uploading all of the mismatches, getting peer-reviewed is a good idea to ensure that the mismatches are valid
  2. open a ticket with your prepared mismatch file and the Wikidata team will upload your mismatches for you