Help:Deduplication

Deduplication is the process of identifying and removing duplicates in Wikidata.

Duplicates are items that cover the same concept (or person) as other items.

Reasons edit

Duplicates can occur for several reasons:

  • lack of de-duplication before new items are created.
Sample: no check for existing items is done before item creation.
  • lack of information about a new concept.
Sample: a new item for a person with ID #123 should be created, but no other information is available to match it with existing items.
  • lack of information in existing items.
Sample: the item for "deduplication" has a label in Chinese and a sitelink to zhwiki, a non-Chinese speaker creates an additional item.
  • inaccurate information in existing items or for new items.
Sample: data for a John Smith (born 1820 and died 1890) should be added, but the Wikidata item about the same person has born 1819 and died 1891.
  • incorrect or problematic structure of existing elements.
Sample: an item is labelled "Galactic President Theodore Evelyn Mosby" and a check for "Ted Mosby" doesn't find it.

When to deduplicate edit

  • prior to creation of new items
  • after creation of new items

Approaches edit

De-duplicate by identifier edit

For properties with a distinct value constraint, a report is generated by KrBot. These reports are linked on the talk page of each property (called "Database reports/Constraint violations"). They contain info about all constraint violations, not just the distinct value constraints. Some widely used properties have grown too large for the bot to handle.

Tools to match identifiers to existing items:

De-duplicate by other elements edit

Labels
Page titles


date of birth (P569) and date of death (P570)
date of birth (P569)
  • Query sample: ..


date of death (P570)

Outcomes edit

  • A. Same concept: items are merged or new information is added to existing item
  • B. Different concepts: a second item is created. different from (P1889) can be used to indicate that they are different
  • C. It's not clear if they are separate: TBD (or search for more information to establish A or B)

Mitigation strategies edit

  • normalize information in existing items
  • normalize information on new item creation
  • enrich newly created items
  • check for duplicates after item creation

Tools edit

See also edit