User:Charles Matthews/YourPaintings report
This report covers the matching to Wikidata items of the artist dataset of the Art UK (Q7257339), as found on their website. A complete pass has been done with the Magnus Manske tool mix'n'match: dividing artists into "matches", "not on Wikidata" and "N/A". See sections below for more on the two latter categories, and qualifications.
The pass finished 17 July 2015, with 8757 matches found (23.9%) from Wikidata to YourPaintings artist pages. That total includes a number of items created specially. "Not on Wikidata" came out at 35.9%.
The project here really got under way in March 2015, when the data began to be "cleaned up". The details page shows that about 30 editors contributed.
Technical issues and N/AEdit
The YourPaintings dataset results from the amalgamation of over 2500 GLAM institutions' data; it is much more miscellaneous than the typical catalog used in mix'n'match. Its unusual provenance is reflected in the high proportion (just over 40%) of items marked "N/A". This was done mainly to remove artists without a date given, who could therefore not securely be matched to any item. (This flagging need not be permanent, since YP data as posted on the website is not necessarily the last word.)
- BBC YourPaintings website
- Mix'n'match tool page, requires WiDar authorisation from a Wikimedia account (see https://tools.wmflabs.org/widar/ for an easy access)
- Mix'n'match manual page, editable
On Wikipedia, matching work start in December 2012 at
There are chronological listings at
and an alphabetical set of pages
That work has yet to be reconciled in detail with the work here; but it should be noted that the matching personnel on Wikipedia and here has overlapped very significantly.
Of the 8757 items matched in Wikidata to the PCF artists, 6863 have links to the English Wikipedia.
For updates, use code of type CLAIM AND CLAIM etc. in Wikidata query. The code of type CLAIM AND NOCLAIM returns a list of items with YP identifier but missing the second property.
To find the item for a given YourPaintings identifier such as william-blake (from http://www.bbc.co.uk/arts/yourpaintings/) use the code STRING[1367:william-blake] (returns a single hit in Autolist).
|Property||Matches July 2015||Matches August 2015|
|Commons category (P373)||5183||5180|
|Commons Creator page (P1472)||3290||3297|
|Oxford Dictionary of National Biography ID (P1415)||1974||1968|
|ULAN ID (P245)||7323||7720|
|RKDartists ID (P650)||6994||7018|
|VIAF ID (P214)||7518||8022|
|GND ID (P227)||3535||3810|
|Library of Congress authority ID (P244)||3221||3656|
|National Portrait Gallery (London) person ID (P1816)||1753||1767|
NB (July 2015): Matches with authority control identifiers have not at this point been as thoroughly sought as they might have been. Updates should see further matching. Clearly another caveat is in also order: all matches here are provisional.
|Type of link||Matches August 2015|
|To deWP and enWP||2465|
|To deWP and not enWP||286|
|To frWP and enWP||2970|
|To frWP and not enWP||318|
|To nlWP and enWP||1778|
|To nlWP and not enWP||93|
|To itWP and enWP||2132|
|To itWP and not enWP||90|
- CLAIM AND LINK[frwiki] https://tools.wmflabs.org/pagepile/api.php?id=260&action=get_data&format=html
- CLAIM AND LINK[frwiki] AND LINK[enwiki] https://tools.wmflabs.org/pagepile/api.php?id=261&action=get_data&format=html
i.e. as good as CLAIM AND LINK[frwiki] AND NOLINK[enwiki]
Issues involved in a full mix'n'match pass can be illustrated by an idealised model:
- Assume names all in given name+family form, e.g. "Jane Doe". Assume also all (English) labels and descriptions present in items coming up in the site search here; and that for painters, for example, "painter" or synonym is present in the description.
Then the issue is simply of deciding whether the person is the right person to match. The primary tool for disambiguation is to look at dates. For near misses or in case of doubt (e.g. a rather common name) further biographical information should be sought.
Incorrect matches, i.e. false positives, are more of a problem than false negatives, because the incorrect information in Wikidata then may not be picked up quickly. False negatives represent missed opportunities, and if items are systematically created after the pass, will require merges. Overall, though false negatives are not the end of the world, and matching should be conservative for that reason.
There are clearly extensions required to the model.
- Names with three parts require "subset search": for "John Paul Jones" search on "John Jones" and "Paul Jones" (standard pattern), for "Jacqueline Bouvier Kennedy" search on the alternate pattern with "Jacqueline Bouvier" and "Jacqueline Kennedy".
Amended with these patterns, and allowing for checking out "Frederick Smith, sculptor" as a possible painter (depending on period, also for architects etc. in related professions), a more realistic and practical model is obtained. It would for example miss "John P. Jones" if: none of "John Paul Jones", "John Jones" and "Paul Jones" were given as aliases, and if the search engine were not set up to deal with the infix initial P. On the scale of tens of thousands of checks, it is not practical to be always as diligent as for a single case.
It would also miss "Jackie Onassis". Further information may always imply more search terms. "Proving a negative" is effectively impossible, but the aim is to add statements to Wikidata with reasonable average-case effort, and avoid too many merges after item creation. The worst-case effort may be quite serious: skipping means such cases can be left to the end for appropriate amounts of attention.