Wikidata:ORES/List of features

Note that these are for people who are not familiar with Wikidata's terminology

Diff edit

General metrics edit

  • Number of added/removed/changed site links (links to Wikipedia)
  • Number of added/removed/changed labels
  • Number of added/removed/changed descriptions
  • Number of added/removed/changed statements
  • Number of added/removed aliases
  • Number of added/removed badges (something that says an article in certain wiki is featured)
  • Number of added/removed qualifiers
  • Number of added/removed references

Specific to certain type of vandalism edit

  • Proportion of Q-ids added. It's a common type of vandalism to add Q-id of items to aliases, labels, etc.
  • Number of changed identifiers (like IMDb id, etc.). Identifiers should barely changed and changing them is a common vandalism.
  • If English label has changed. Changing English label is a common type of vandalism.
  • Proportion of language name added (like adding "English"). Adding names of language since it's in GUI, is a common vandalism.
  • Proportion of external links added: In order to catch spamming.
  • Is gender changed: Changing gender is a common type of vandalism
  • Is country of citizenship changed: Another type of vandalism
  • Is member of sports team changed: Changing statements regarding teams that sportsperson has played is a common type of vandalism.
  • Is data of birth changed
  • Is image changed
  • Is image of signature changed
  • Is category of this item at Wikimedia Commons changed
  • Is official website has changed

Specific metrics to exclude certain type of edits that are likely to be mistaken as vandalism edit

  • Is it a client edit: When a user moves a page in Wikipedia (a client of Wikidata) or deletes the page, an edit is made in wikidata to update the central repository. This type of vandalism is related to the client itself and using these features they are automatically excluded.
  • Is it a merge: Merges, which is not enabled for new users, tends to change the item drastically hence they tend to be flagged as vandalism but merges are mostly correct and if they are wrong, they are not vandalism due to merges being disabled for new users.
  • Is it revert, rollback, or restore: These edits are trying to undo vandalism and most of the time they are correct.
  • Is it creating a new item: Creating new item tends to have high probability due to adding content.

Revision edit

General metrics edit

All of these features are properly scaled using log(feature + 1)

  • Number of statements
  • Number of aliases
  • Number of references
  • Number of labels
  • Number of qualifiers
  • Number badges
  • Number of site links
  • Number of descriptions

Specific to certain type of vandalism edit

  • Is this item is about a human
  • Is this item is about a living human

Editor edit

  • Age of editor in seconds, scaled using log(age + 1)
  • Is the user is a bot
  • Is the user is anonymous