User:Charles Matthews/YourPaintings report

This report covers the matching to Wikidata items of the artist dataset of the Art UK (Q7257339), as found on their website. A complete pass has been done with the Magnus Manske tool mix'n'match: dividing artists into "matches", "not on Wikidata" and "N/A". See sections below for more on the two latter categories, and qualifications.

The pass finished 17 July 2015, with 8757 matches found (23.9%) from Wikidata to YourPaintings artist pages. That total includes a number of items created specially. "Not on Wikidata" came out at 35.9%.

The project here really got under way in March 2015, when the data began to be "cleaned up". The details page shows that about 30 editors contributed.

Technical issues and N/AEdit

The YourPaintings dataset results from the amalgamation of over 2500 GLAM institutions' data; it is much more miscellaneous than the typical catalog used in mix'n'match. Its unusual provenance is reflected in the high proportion (just over 40%) of items marked "N/A". This was done mainly to remove artists without a date given, who could therefore not securely be matched to any item. (This flagging need not be permanent, since YP data as posted on the website is not necessarily the last word.)

Basic linksEdit

On Wikipedia, matching work start in December 2012 at

https://en.wikipedia.org/wiki/Wikipedia:GLAM/Your_paintings

There are chronological listings at

https://en.wikipedia.org/wiki/Wikipedia:GLAM/Your_paintings#Artists_by_birth_date

and an alphabetical set of pages

https://en.wikipedia.org/wiki/User:Magnus_Manske/Your_Paintings

That work has yet to be reconciled in detail with the work here; but it should be noted that the matching personnel on Wikipedia and here has overlapped very significantly.

Cross-matchingEdit

Of the 8757 items matched in Wikidata to the PCF artists, 6863 have links to the English Wikipedia.

For updates, use code of type CLAIM[1367] AND CLAIM[373] etc. in Wikidata query. The code of type CLAIM[1367] AND NOCLAIM[373] returns a list of items with YP identifier but missing the second property.

To find the item for a given YourPaintings identifier such as william-blake (from http://www.bbc.co.uk/arts/yourpaintings/) use the code STRING[1367:william-blake] (returns a single hit in Autolist).

Property Matches July 2015 Matches August 2015
Total 8757 8669
Commons category (P373) 5183 5180
Commons Creator page (P1472) 3290 3297
Oxford Dictionary of National Biography ID (P1415) 1974 1968
ULAN ID (P245) 7323 7720
RKDartists ID (P650) 6994 7018
VIAF ID (P214) 7518 8022
GND ID (P227) 3535 3810
Library of Congress authority ID (P244) 3221 3656
ISNI (P213) 3744 4135
National Portrait Gallery (London) person ID (P1816) 1753 1767

NB (July 2015): Matches with authority control identifiers have not at this point been as thoroughly sought as they might have been. Updates should see further matching. Clearly another caveat is in also order: all matches here are provisional.

LinkageEdit

Type of link Matches August 2015
To enWP 6831
To deWP 2751
To deWP and enWP 2465
To deWP and not enWP 286
To frWP 3288
To frWP and enWP 2970
To frWP and not enWP 318
To nlWP 1871
To nlWP and enWP 1778
To nlWP and not enWP 93
To itWP 2222
To itWP and enWP 2132
To itWP and not enWP 90

ListsEdit

After exclusion:

i.e. as good as CLAIM[1367] AND LINK[frwiki] AND NOLINK[enwiki]

ProcedureEdit

Issues involved in a full mix'n'match pass can be illustrated by an idealised model:

Assume names all in given name+family form, e.g. "Jane Doe". Assume also all (English) labels and descriptions present in items coming up in the site search here; and that for painters, for example, "painter" or synonym is present in the description.

Then the issue is simply of deciding whether the person is the right person to match. The primary tool for disambiguation is to look at dates. For near misses or in case of doubt (e.g. a rather common name) further biographical information should be sought.

Incorrect matches, i.e. false positives, are more of a problem than false negatives, because the incorrect information in Wikidata then may not be picked up quickly. False negatives represent missed opportunities, and if items are systematically created after the pass, will require merges. Overall, though false negatives are not the end of the world, and matching should be conservative for that reason.

There are clearly extensions required to the model.

Names with three parts require "subset search": for "John Paul Jones" search on "John Jones" and "Paul Jones" (standard pattern), for "Jacqueline Bouvier Kennedy" search on the alternate pattern with "Jacqueline Bouvier" and "Jacqueline Kennedy".

Amended with these patterns, and allowing for checking out "Frederick Smith, sculptor" as a possible painter (depending on period, also for architects etc. in related professions), a more realistic and practical model is obtained. It would for example miss "John P. Jones" if: none of "John Paul Jones", "John Jones" and "Paul Jones" were given as aliases, and if the search engine were not set up to deal with the infix initial P. On the scale of tens of thousands of checks, it is not practical to be always as diligent as for a single case.

It would also miss "Jackie Onassis". Further information may always imply more search terms. "Proving a negative" is effectively impossible, but the aim is to add statements to Wikidata with reasonable average-case effort, and avoid too many merges after item creation. The worst-case effort may be quite serious: skipping means such cases can be left to the end for appropriate amounts of attention.