Wikidata:Tools/Author Disambiguator

Author Disambiguator is a tool for editing the authors of works recorded in Wikidata. The tool was developed as part of the m:WikiCite initiative, and is partially coordinated with the Scholia project that provides visual representations of the scholarly literature based on what can be found in Wikidata. As of October 2020 Scholia statistics showed Wikidata contained data for over 36 million scholarly articles, for which authors were represented as simple strings (the author name string (P2093) property) in 133 million instances, and as links to author items (the author (P50) property) about 19 million times. Author item relations allow for much richer analysis and tracing of relationships between researchers and their works, institutions, etc. The aim of this tool is to assist in converting those strings into links to author items as efficiently and easily as possible.

Main features edit

Find and group works with (close to) a given name string edit

 
Author disambiguator author name entry form

The main field in the author name entry form is the author name - this is the name used both to find works with this name string, and potential author items in Wikidata that could be used to replace those strings. The name should be entered in natural order (Given name(s) Family name(s) for western authors for example). You can also just cut and paste the name exactly from the string value in a sample work. Behind the scenes the name is parsed into components (separated by spaces or dashes) which are used to generate other potential forms of the name that may have been used in works. In particular the various options selected determine specifically how the name is used for searching:

  • Fuzzy match: this does the most aggressive form of automated name parsing, looking for first and middle names as initials, upper-case versions of the names, "Last F" format, etc. In most cases, except with very common family names, this is probably the most useful option if you are trying to find the largest possible selection of works to match. (Note this matches only author name string that is fuzzy by itself only; For example, searching "Jim Smith" will return results for "J Smith" and "Smith J", but not "Jimmy Smith".)
  • Wikibase search: by default the service only uses exact string matches to the generated variations on the name. With this option the search is also extended to effectively use the Wikidata search box for the name (in particular it will ignore all accents and case variations). The search term is treated as quoted, so "James Baker" will match "Peter James Baker" and "James Baker-Jarvis", but not "James F. Baker" or "James Kenneth Baker".
 
Example of the "Specify name strings" box with variations on the provided name.
  • Specify name strings: check this and immediately hit the "Look for author" button, and a text box containing the possible name variants appears, looking something like the example to the right. By default this box shows the name variants generated automatically from the supplied name - you may notice in this case there are versions with and without accents, and with initials for middle names or no middle names at all, as well as the full supplied name. The text box then allows you to remove names from the list or to add variants that were not auto-generated. Enter one name string per line. These then allow for more precise specification of the name strings used for searching works and author items. In the example given here, even with fuzzy matching the auto-generated names did not include the common variation "J. Benlloch", so adding that variation was useful.
  • Additional SPARQL filters: this is mostly useful if you are seeing far too many matching works (more than the 500 limit for example!) or if you otherwise want to filter the works you are matching on. The filters will be applied to the associated works, so any property of a work could be used. The example suggestion uses main subject (P921), but you may also be interested in filtering on author name string (P2093) (a co-author name string), author (P50) (a particular identified co-author), published in (P1433), etc.
  • Filter potential authors as well?: this applies the SPARQL filter to any works the person is an author (P50) for, so only authors with matching works will be listed.

The search for author items also looks at the object named as (P1932) value often used as a qualifier on author (P50) statements, as well as the labels and aliases on the author items themselves. If you are surprised by an author item shown in the resulting list, it may be because of an unexpected (or erroneous) alias or object named as (P1932) value somewhere.

Once works have been found to match the author name string search, a clustering algorithm is used to display them in groups. The groupings are based on several criteria, including the names or identifiers for co-authors, any listed topics, or journal of publication. An alternative clustering algorithm based strictly on the name string format of the given author and the preceding (if any) and succeeding (if any) author names or name strings is also available via a link at the top of the groups. The groups are roughly ordered by size, with the larger groups first, and within groups the works are ordered by (descending) publication date, if any. Works with no publication date found in Wikidata are listed at the end of each group. All works that could not be clustered with any other are placed in a group called "Misc" at the bottom, which is otherwise similarly ordered. The clustering is intended to group works by different authors into different groups, so it should usually be reasonable to select all the works in a given group (except for the "Misc" one) to match to the associated author item.

 
Start of the "Potential Publications" list, with the first grouping of works.

For each work the title is displayed, linked to the work page within the tool. Then the author list, with already matched author items shown in green (linked to their author page within the tool) and unmatched authors in blue (linked to the associated name search page). The author name that matches the search criteria is shown in black with a checkbox to select if we want that author name string replaced with the selected author item. Other links in the table go either to the associated Wikidata item or to the external website (for DOI or other identifiers). Publications and topics (and for author items, institutions) also link out to the Scholia "missing" page associated with them, which provides a list of associated but still-unmatched author name strings.

If the clustering criteria (co-authors, publications, topics) match one of the author items found, the right-most column of the table shows the matching author (or authors if there are more than one that match), also linked to its author page within this tool.

 
Matching work for an author, showing the author name string amidst the author list, expected match on the right side.

Note that if there are a large number of authors on a work, the author list is abbreviated to only show the first ten, and then up to five surrounding the matched author name string. If more than one author name string matches, all matching authors will be shown with their associated checkboxes, so the correct one can be selected.

Below the groups of works is the list of potentially matching authors. Only one may be selected, or the "Other Q number for this author" option, where an author not listed may be used. There is also a form for creating a new author item within Wikidata if necessary.

 
Potential authors listing, with button to start linking process

Clicking the "Link selected works to author" will start a batch process that, for each listed work, replaces the selected author name string with an author item with the same qualifiers and references, and an additional object named as (P1932) qualifier with the original name string value.

Find works with a given author edit

 
Entry form for author Wikidata Qid

This page (found from the "Authors" link in the top right navigation bar, or via author item links on other pages in the tool) shows all works having a given author (P50) value. Similar to the name search page, an additional SPARQL filter can be used to limit the resulting works list based on topic, publication venue, coauthors, etc. The resulting list of works is again ordered chronologically in reverse by publication date, with the same links shown as works listed in the name search page. If some works have been assigned to the wrong author item they can be moved to the correct one via the form at the bottom of the works list, where the Wikidata ID of the correct author item may be entered.

The "Find duplicates to merge" checkbox searches for works linked to this author that have more than one author name or author name string associated with the same series ordinal (P1545) value - often this is due to duplication, or neglecting to remove the author name string (P2093) value when a author (P50) was added. If the names match (based on similar name parsing criteria as used for main author name matching), then a checkbox is shown next to the work, allowing those values to be merged (i.e. author name string (P2093) and duplicate author (P50)'s removed, qualifiers and references merged, etc.) Cases where the names do not match show a 'mismatch' indicator, which should probably be examined on an individual basis to address the problem.

View and edit authors on a given work edit

 
Form to enter the Wikidata Qid of a work item

This page is reached via the Works link in the top-right navigation bar, or from a link on one of the other pages. Depending on the checkboxes selected, the page has several different modes for viewing or editing the author list for a work. In all modes the main table shows the authors, listed sequentially based on their series ordinal (P1545) value. Authors with no series ordinal (P1545) are listed at the bottom. As for the name search page, author entries which are just strings (author name string (P2093)) are shown in blue, linked to the associated name search page, and author items (author (P50)) are shown in green, linked to the associated author page in this tool.

In default mode (no checkboxes selected in the top form), the work item page allows removal of un-numbered authors, or merging multiple author/author name string values associated with the same number. If none of these changes are possible, no action button is displayed at the bottom of the page.

In "renumber" mode (check "Renumber authors?") the series ordinal values for any of the author names or items can be modified. This works only up to a maximum of 5000 authors on a given work. Note that in this and in other modes for a work item, when the edit is made it is done in a single edit to the Wikidata item - this reduces the load on associated updates on the query service. Authors with no change in series ordinal value will not be affected by such an edit.

In "match" mode (check "Suggest matches?") a list of potential matching author items is used to try to find items to replace as many as possible of the author name string values remaining. By default this list comes from all items that are coauthors (on other works) of author items already identified on this work. However, other lists of authors may be used for matching by selecting a different choice from the "Author List" drop-down - see the "managing lists of author items" section below. Selecting the 'Use "stated as" names' checkbox uses the full matching algorithm with object named as (P1932) values from other works by that author, making it more likely an author item will match one of the author name strings on the work; however for authors with many works this query will take additional time, so could be avoided if not necessary.

Manage lists of author items for use in matching edit

This feature is still in development. The page is reached through the "Lists" link in the top-right navigation bar. It allows creation and management of lists of Wikidata author items - a large collaboration, other coauthors, or just a limited topical selection list. The lists can be selected on the work-item page for the purpose of matching authors.

Ordering in these author lists doesn't currently matter; authors are displayed in the order they were added. Authors can be added individually or as all identified authors on a given work or works. Author lists can be compared with one another, and also with the authors on a particular work item, to identify common and differing elements.

Monitoring, stopping, or restarting batches of edits edit

Edits to work items made with the Author Disambiguator tool are all done in a background batch mode. Each batch consists of one or more edits associated with your activities on a given author or work item. All your batches can be found through the "Batches" link in the menu bar. Batches are listed in reverse chronological order (based on last modified date, not creation date). Each batch is also associated with an "edit group", which can be reviewed with the Edit Groups tool.

For each user (identified through OAuth) only one batch is allowed to run at a time, and within that batch only one edit can be done at a time - that edit is shown as in "Running" state. Other edits that are waiting show as "Ready". A successfully completed edit shows as "Done". If there was any problem completing an edit it will indicate an "Error" state, with an associated message visible on the page for that particular batch. This should be a useful message indicating what the problem was, for example "duplicate ordinal '129'" indicates that two or more distinct author items were matched to the author name at series ordinal 129. If the error message indicates a temporary problem (for example a "failed to save" message from the Wikidata API) then the "Reset errors" link can be used either on the individual batch or batch listing page, and the batch can then be restarted to retry that particular edit. Batches can also be stopped and restarted from the listing page.

Note that there may be times when the Wikidata servers are busy and a particular edit may appear to be in "Running" state for a long time (an hour or more). Check the dispatch lag/maxlag statistics on grafana to verify that this is what is happening. If that doesn't appear to be the problem, try stopping and restarting the batch.

Deleting completed (or erroneous) batches is recommended; this has no effect on the "Edit Groups" functionality or on any of the completed edits, and leaves the database a little cleaner.

Source code, change requests, etc. edit

The Author Disambiguator tool runs on ToolForge, with the code managed in a GitHub repository. Please use the GitHub issues page to suggest changes or make any other requests.