User:Bovlb/wd-deleted

Wikidata Deleted Search edit

Usage edit

The tool can be accessed via https://wd-deleted.toolforge.org/ or by using the user script User:Bovlb/deleted-search.js. OAuth login is required, and it can only be used by Wikidata admins and rollbackers. For policy reasons, rollbackers have restrictions on seeing content from deleted items.

Fields:

  • Query: Enter text or QID to search for.
    • If you enter the QID (e.g. Q114239609) of a non-deleted item, it will search for similar deleted items.
    • For other text, it searches all labels, descriptions, and aliases.
  • Advanced: This checkbox disables the escaping of the query string, allowing you to use Solr queries. Some useful options are:
    • Fuzzy search: TEXT~
    • Restrict to English labels: label_en:smith
    • Phrase search: "john smith"
    • Exact match: label_en_str:"John Templeton Smith"
    • Field search: claim_str_P345:"nm123456"
See the expanded query generated by QID search for some fields you might use.
Be careful using this option if your query contains punctuation.
  • Rows: Limits the number of rows returned. This is primarily useful to exclude irrelevant results from the canned messages.

Results edit

Query results show:

  • Title:
    • Item Q-id, linked to the undelete page for review of deleted versions
    • Match score, as returned by Solr. Not well-calibrated.
    • Highlight drilldown (admins only): Click on the speech bubble to get an expanded view of the matches line, together with a dump of all fields.
    • Labels, across all languages. Only with full access.
    • Descriptions, across all languages. Only with full access.
  • Created: user (linked to contributions), date-time
  • Deleted: user, reason, date-time
  • Matches: Fields that had "hits" with those hits highlighted (partial display for rollbackers)

Below query results are various formatted texts suitable for copying this information into Wikidata venues. These are based on all results shown, so you might want to use "rows" to trim them down to only appropriate entries. Remember that you are responsible for any information exposure.

Known issues edit

  • Back end index is manually updated. Need to work out the permission issues for automation.
  • Index does not take account of subsequent undeletion or oversighting.
  • Queries containing periods are not handled well. Need to enhance escaping routine.
  • Tokenisation can produce unexpected results, e.g. for words with numbers in them.
  • Schema is undocumented.
  • Only the last version of a page is used in the index.
  • Contributing users are not tracked. There are a lot of spam page started by an IP and continued by an account.
  • Index only goes back about four months. As it grows, I'm probably going to have to start trimming it, as it's running on one modest server.
  • When doing match highlighting in a field that contains text of a different direction from the browser default (e.g. Arabic text for an English user), then the parts of the match highlight do not always appear in the correct order.
    • Maybe this behaves correctly for some users.
  • QID search only works for existing items, not for deleted items. Should be able to make this work.

Future work edit

  • Search by QID: Search for deleted pages similar to a current page. Associated user script so you can click search from a page.
    • Search by QID now available
    • Simple userscript now available.
  • Spam explorer: Should be a feature to explore networks of creating users.
  • Similarity score: A more reliable score to say whether a current page is a recreation. Could also use other information like identifiers and URLs. Associated user script to annotate (say) Recent Changes with a badge.
  • Show which users are currently blocked.
    • Blocked users are indicated with strikeout, and details are given in a tooltip.
    • Block reason now split for block evasion and abuse of multiple accounts
  • References are not indexed. Would be good to be able to search by a specific URL.

Privacy concerns edit

All searches made with this tool are logged along with your username. These logs are available to tool maintainers, and are used for debugging purposes. There is no intention to publish the logs in any form.