User:ArthurPSmith/10th Birthday

ATLAS Author Disambiguation - Happy 10th Birthday Wikidata! edit

One of the big things I've worked on in recent months is the author lists for papers from the ATLAS experiment (Q299002). With regular first author (or sometimes second author) Georges Aad (Q57305547) the ATLAS collaboration at CERN (Q42944) has a relatively consistent list of about 3000 authors on close to 1000 published papers. From this query right now Georges Aad is an author on 877 Wikidata items which have 2521876 (over 2 and a half million) author statements between them. About 1.7 million of those authors are statements I added in the last few months with the "author list" feature in Wikidata:Tools/Author Disambiguator. Other Wikidata editors have worked on ATLAS (special thanks to Simon Villeneuve) but mostly handling authors one at a time. After experimenting on somewhat smaller collaborations I decided around the middle of 2022 that I was ready to tackle the much longer author lists in ATLAS.

The challenges with these author lists are many: given names are truncated to just a single initial, making name collisions (the same string value) quite common; there are even several cases where ATLAS simultaneously had two authors with the same full name - at various times there were two 'Brian Martin's, two 'Chao Wang's, or two 'Yi Yang's. While the author lists are relatively stable, they do change on the order of 10% every year - that means about 300 new names to add and 300 to remove from the lists each year (I decided to go with annual lists after attempting a 3-4 year list at first). Sometimes the same person has changed names, or goes by different names; related are Hispanic names with multiple surnames that are sometimes all present and sometimes only partially. There were also numerous cases (but not a huge percentage) of previous erroneous edits that had to be corrected. Even with seemingly very rare names, there can be two different people with that name. Sometimes the same paper had been added twice and then the two items merged, but the two author lists had an off-by-one error somewhere so a massive renumbering was needed. For a number of items the author list was erroneously truncated at 2048 (I guess the import script had assumed the author count would never be higher than that!) All these issues can be fixed somewhat routinely by the Author Disambiguator now.

Thankfully there is significant metadata available on these authors since ATLAS began publishing (around 2010), so they can be disambiguated generally by affiliation. I could find full author names and affiliations usually from INSPIRE-HEP (Q5972440), from some directories published by the ATLAS experiment themselves, from various university websites, and similar sources, and the ORCID iD (Q51044) database was very useful to confirm matches, especially for recent papers. Adding the 2000+ authors for the first paper I worked on took a long time, but once I had that list I could use the Author Disambiguator match function to try to match author names in the next paper, which (if they were close in publication date) would usually get around 90% of the authors on the first try. For each auto-match a manual review was needed before hitting the update button, for name strings that had matched to multiple author items (there's now a simple filter to show these), and for author items that had matched to multiple name strings (search for the "also" string on the page). I did plenty of spot-checking of affiliations between the Wikidata item and the published article, and at least once per year of published articles I systematically checked the author entries for every author with a repeated family name (all the Martin's, Schmitt's, Wang's, Weber's, etc.). This surfaced a few more cases where the same name string had switched from meaning one person to another during a single year; sometimes there were a few papers having both authors with the same name string in the middle of the year, which was easier to spot with the duplication checks. In a few rare cases there was a gap of a few months where neither author appeared in the author lists, a changeover that could have been quite confusing. There are likely still issues like this in the dataset due to delays between article submission and publication, and particularly for errata that can be sometimes published several years later.

The final result: almost every ATLAS paper has either 100% or very close to that of their author name string (P2093) claims now replaced by author (P50) statements, identifying the authors with Wikidata items. The total (2.5 million author statements) is about 2% of the 134 million remaining author name strings noted in Scholia statistics, and about 10% of the 27 million author-work links so far. So not bad for a few months work! And finished just in time for the 10th birthday - Happy Birthday Wikidata!