Wikidata:WikiProject Structured Data for Commons/Data transfer
Data matching / extraction / comparisonEdit
- 2393 unmatched creators, 920 unmatched institutions -- 16:19, 5 December 2014 (UTC)
Previous work on matching Creator templates to wikidata entriesEdit
User:Jheald asked me to explain some of the procedures used by c:user:JarektBot to match Creator templates with wikidata entries, which I always saw as serving 3 purposes: (1) disambiguation, (2) providing more info about the person, and (3) being first step in moving most of the content of individual Creator templates to wikidata entries and rewriting c:template:Creator to use wikidata.
Matching Creator templates to wikidata entries was really one of many steps in a long series of codes and procedures aimed at improving and synchronizing metadata. Some of the codes were written in python and utilizing existing pywikibot libraries others were done using AutoWikiBrouser using 3 phase procedure of: (1) harvest data using AWB custom modules like this; (2) Assemble and process it in some spreadsheet, and (3) use AWB to copy extra fields to the page using this (horrible) module, AWB CSVLoader and eventually more reliable python codes.
The workflow was as follows:
- Find people categories which are not in c:category:people by name. This was done by creating with CatScan2 lists of categories with dates of birth/death or subcategories of [[c:category:Artists] or c:Category:Politicians which were not in c:category:people by name but which sounded like names. Those list were verified by hand.
- Find subcategories of c:category:people by name
- Run c:User:JarektBot/Commons People categories.py which was
- matching commons categories to wikipedia articles using name match + years of birth/death
- adding interwiki links, date of birth/death category, default sort, etc. to commons categories
- Run modified interwiki.py to update interwiki links in all Commons categories
- Run Creator template maintenance.py which copies many fields from wikipedia or category namespace, like: authority control data, dates of birth/death, wikidata link, names and wikipedia links for different languages. In the past the code was also relying on http://creatorlinks.wmflabs.org database, which was a temporary fix to wikidata not being fully operational yet.
- Another approach I tried in the past was to compile lists of all the people categories on commons and articles in few major wikipedias and use AWB to harvest dates of birth and death. Than I would run a code that generated pairs of names (one on commons and one on wikipedia) with the same pair of dates and manually check if the names are not just alternative spellings of the same name. I found quite a few matches that way and added interwiki links that way and other info that way.
The above codes were not run for last year or two and would need to be verified and possibly modified. I was also planning to add a part where I synchronize the information with wikidata as well, but I never had time to do it. The last step would also require getting bot flag on wikidata. Anybody is welcome to cannibalize my codes for any purposes if they are useful. Hope that helps with clarifying how things were done so far, and I would be happy to go into greater details if necessary. --Jarekt (talk) 20:35, 19 August 2014 (UTC)
- Note -- as of August 2014, c:User:Zhuyifei1999's YiFeiBot will also be taking up this task -- see c:Commons:Bots/Work requests and c:Commons:Bots/Requests/YiFeiBot (20) Jheald (talk) 12:34, 21 August 2014 (UTC)
- User:Multichill has also been doing some work on this:  Jheald (talk) 14:36, 6 September 2014 (UTC)
Matching Creator templates to wikidata entries as of 2017Edit
- less then 1500 left in c:Category:Creator templates without Wikidata link
- Two mix-n-match projects started:
- https://tools.wmflabs.org/mix-n-match/#/catalog/471 162 templates - 100% done
- https://tools.wmflabs.org/mix-n-match/#/catalog/510 829 templates - 50% done
In case of the second job most matches are done and what is left is new item creation. If you create new items please add q-code to the creator page on Commons. --Jarekt (talk) 04:52, 3 September 2017 (UTC)