Wikidata:WikiProject Authority control/Archive
Wikidata pays a lot of tribute to authority control, linking to all kinds of datasets and databases with various IDs. The holy grail of every GLAM worker Sum of All People, with links to their Works is coming about!
But we’re just at the start of a lot of work in that direction. The purpose of this project is to try and coordinate such work.
I know a few things (ViafBot, Mix-n-Match) and I'd like to help with some things, but I don't know what others are doing. --Vladimir Alexiev (talk) 08:08, 27 January 2015 (UTC)
Data Sources
editThe report Name Data Sources for Semantic Enrichment shows that when it comes to name data sources, maybe the two that matter are VIAF and Wikidata.
- Their name coverage is fairly orthogonal: VIAF has more name variations and permutations, Wikidata has more translations (Venn diagram of names for Cranach).
- VIAF is much bigger: 35M persons/orgs. Wikidata has 2.7M persons and maybe 1M orgs.
- Only 0.5M of Wikidata persons/orgs are coreferenced to VIAF, with maybe another 0.5M coreferenced to other datasets, either VIAF-constituent (eg GND) or non-constituent (eg RKDartists). So coreferenced part between the two is still quite small (30%) and a lot of work remains!
- A lot can be gained by leveraging coreferencing across VIAF and Wikidata: finding errors in Authority files, finding merge candidates in Wikidata, promulgating identifiers...
- Wikidata has great tools for crowd-sourced coreferencing.
Please comment!
RKDArtists Coreferencing
editRKDartists is an important Authority that does not yet participate in VIAF. There are already 21760 RKDartist id's on Wikidata. These could be imported to VIAF for free!
British Museum Coreferencing
editThe BM has several thesauri that are not co-referenced to anything in the world. I think they'd see it as a major win if the community helps them to co-reference.
- British Museum person or institution ID (P1711): 176461 persons, 21511 matched
- Wikidata:Property_proposal/British_Museum_place: 45883 places
- Wikidata:Property_proposal/British_Museum_thesauri: 28 more thesauri with 26804 entries
This could be followed by importing the 2.5M cultural objects of the BM.
ULAN Coref Relations
editULAN does record possible matches and mismatches in their editorial system: ULAN Artists Whose Identity May be Associated or Confused With Another (608 pairs).
Looks like this:
x | x_name | x_bio | rel | y | y_name | y_bio |
---|---|---|---|---|---|---|
ulan:500071106 | Master of 1515 | Portuguese painter, active 1515 | gvp:ulan1005_possibly_identified_with | ulan:500025279 | Afonso, Jorge | Portuguese painter and court artist, born ca. 1470-1475, died before 1540 |
ulan:500042027 | Master of the Madre de Deus Retablo | Portuguese painter, active 16th century | gvp:ulan1005_possibly_identified_with | ulan:500025279 | Afonso, Jorge | Portuguese painter and court artist, born ca. 1470-1475, died before 1540 |
ulan:500032055 | Monogrammist A. M. | Spanish artist, active 19th century | gvp:ulan1005_possibly_identified_with | ulan:500038287 | Aguirre, Marcial | Spanish sculptor, 1841-1900 |
Here's to proper coreferencing! --Vladimir Alexiev (talk) 18:07, 12 March 2015 (UTC)
Match Persons not Disambiguation Pages
editWe should match persons to persons, not disambiguation pages to persons or other disambiguation pages.
Wikipedias, GND and RKD all have disambiguation pages (in GND they are called "undifferentiated names"). 13 Feb 2015:
- Wrote to Jane
- Wrote to Magnus: Filter out Disambiguation entries and Un-notable Persons
Do you agree with my reasoning:
- Jane said "any match is better than none"
- I countered "A correct match is better than none"
- the only way to make sure it's correct is to examine more data about the person, which will necessarily lead you to a real person page.
- Look at the ULAN data above: that's good data that gives you some basis for decision. A name alone does not.
--Vladimir Alexiev (talk) 19:34, 12 March 2015 (UTC)
- Support @Vladimir Alexiev, Randykitty, Ghuron: A GND Tn (Thesaurus name = undifferentiated) is not a stable disambiguation page. A Tn is a placeholder. It can be deleted, it can be upgraded into a Tp (Thesaurus person), or changed into a redirect. Works connected with a Tn will be checked by the library or archive who owns them and afterwards might be delinked. The database Online GND (OGND) includes only Tp numbers. --Kolja21 (talk) 00:04, 28 March 2015 (UTC)
Coreference AAT
editAAT is a crucial thesaurus in cultural heritage.
- It has 40k concepts, see http://vocab.getty.edu/sparql
select (count(*) as ?c) { ?x a skos:Concept; skos:inScheme aat: }
- Of them only 363 AAT are coreferenced, or under 1%, see http://tools.wmflabs.org/autolist/autolist1.html?q=CLAIM%5B1014%5D
I think that's BAD. I'm sure going to need that coref for the Europeana Food and Drink Classification Scheme that will be based on Wikidata and AAT:
Update: the AAT-Wordnet coreference described below is brought into Wikidata. AAT is actively coreferenced on Mix-n-Match: 12985 (32%) matched, 3293 (8%) awaiting confirmation, 1553 (3.8%) confirmed no-matches, and 22543 (55.7%) awaiting matching. So it's way better than 2 years ago. Help coreference this pivot thesaurus that is of immense importance for Cultural Heritage! --Vladimir Alexiev (talk) 15:04, 19 September 2017 (UTC)
Coreference AAT with Mix-n-Match
editThe Wikidata coref tool Mix-n-Match has mostly been used for people until now. But I hope it can be used for concepts as well.
I made an export that includes AAT URL, preferred English label (without qualifier), parents (ascendants to root) and scope note (description). Could also add alternative labels, and labels in other languages (Dutch, Spanish, Chinese).
select ?id (str(?lab) as ?label) ?parents (str(?scopeNote) as ?note) { ?x a gvp:Concept; dc:identifier ?id; gvp:prefLabelGVP/gvp:term ?lab; gvp:parentString ?parents. optional {?x skos:scopeNote [dct:language gvp_lang:en; rdf:value ?scopeNote]} }
I saved as XML then converted to TDV: aat.rar.
rset --results tsv aat.xml > aat.tdv
Also see https://meta.wikimedia.org/wiki/Talk:Mix%27n%27match#Coreference_AAT !!!
Coreference AAT through BabelNet
editMix-n-Match has good automatic matching, but that works for people.
So let's check what other vocabs that are coref to AAT may be coref to Wikidata: According to Michiel Hildebrand's famous CH LOD diagram:
- Wordnet. No such prop in Wikidata
- I'd guess Wiktionary is coref to Wordnet, but Wikidata got no site links to Wiktionary
- RKD Concepts. There's prop "RKDartists" and "RKDimages" but none for concepts
- Rijksmuseum Concepts. There's "Rijksmonument" but none for concepts
- Joconde: aha! There's Joconde work ID (P347), and it has 2275 instances, so that's better. Joconde is 18% coref to Wikidata but I don't know how much to AAT, maybe I can gain 1k here.
- Looked at the results: nope, Joconde are all paintings, not concepts
- Bibliopolis: never heard of it, and nope
- SVCN: never heard of it, and nope
Then it dawns on me.
- BabelNet corefs Wordnet and Wikipedia! In fact, it corefs to Wikidata, see http://babelnet.org/stats. (Very useful stat: http://babelnet.org/stats#Numberofpolysemousandmonosemouswords)
- Some months ago they didn't have RDF access. But they do now: http://babelnet.org/download (not download, but API: good enough)
- On the web view of chipotle in Sources I see Wikidata
- On the RDF view http://babelnet.org/rdf/page/s00018522n I don't see Wikidata but I see DBpedia, so that's ok
AAT-Wordnet coref
editOk, so off to look for that AAT-Wordnet coref. - Why yes, it's part of http://semanticweb.cs.vu.nl/europeana/skos/browse/ - I got a file from somewhere that says
<aat_wordnet20_mappings> a void:Linkset; dcterms:title "AAT-Wordnet 2.0 mappings by Anna Tordai (baseline)" ; lib:source <http://semanticweb.cs.vu.nl/lod/getty/aat/> ; void:dataDump <bl_aat_wn.rdf> , <bl_norm_aat_wn.rdf> , <bl_sing_aat_wn.rdf> .
(Note: you can get those files from URLs like: http://semanticweb.cs.vu.nl/europeana/api/export_graph?graph=http://semanticweb.cs.vu.nl/lod/getty/aat/bl_sing_aat_wn.rdf&mimetype=text/plain&format=turtle)
These are called "baseline" (i.e. mostly literal matches). A quick conversion to Turtle and a line count:
$ wc -l bl* 2300 bl_aat_wn.ttl 4369 bl_norm_aat_wn.ttl 4303 bl_sing_aat_wn.ttl 10972 total
Run a query at http://semanticweb.cs.vu.nl/europeana/user/query (specify entailment=None or else!):
prefix getty: <http://purl.org/vocabularies/getty/> prefix aat: <http://purl.org/vocabularies/getty/aat/> select * {?x skos:inScheme getty:aat; skos:closeMatch ?y}
It returns 4592 (see below why).
AAT-Wordnet Overlaps
editThere is significant overlap between the files:
$ cat bl* | sort| uniq | wc -l 4596 $ cat bl* |cut -d " " -f 1 | sort| uniq | wc -l 4581
The following AAT concepts have 2 matches:
aat:bleachers aat:boxcars aat:cleavers aat:feudalism aat:groats aat:jackstraws aat:lats aat:leotards aat:morocco aat:ninepins aat:quoits aat:shekels aat:stairs
We need to reconcile them manually, eg
aat:bleachers skos:closeMatch <http://www.w3.org/2006/03/wn/wn20/instances/synset-bleacher-noun-1> . aat:bleachers skos:closeMatch <http://www.w3.org/2006/03/wn/wn20/instances/synset-bleachers-noun-1> .
The AAT definition is:
- aat:bleachers vp:descriptiveNote "Use for benchlike tiered seating for spectators at, for example, outdoor sporting events, usually without weather or sun protection, affording less advantageous views than grandstands; may also be used for similarly constructed, often telescoping, indoor seating."@en .
- Inspection at Wordnet 3.1 shows that the second one is right.
That's 4.6k matches, or 11% of AAT.
AAT-Wordnet2 Representation
editThe coref looks like this:
aat:wrought_iron skos:closeMatch <http://www.w3.org/2006/03/wn/wn20/instances/synset-wrought_iron-noun-1> .
And there's another file aat.ttl with rep like:
aat:wrought_iron aat:parentPreferred aat:iron_alloy . aat:wrought_iron vp:id "300011012" . aat:wrought_iron vp:labelPreferred "wrought iron"@en . aat:wrought_iron vp:labelNonPreferred "iron, wrought"@en . aat:wrought_iron vp:labelNonPreferred "wrought-iron"@en .
This is quite old rep. The new rep uses numeric URL: http://vocab.getty.edu/aat/300011012 (and a bunch more data). So we need to construct a numeric URL.
AATNED-Cornetto Mapping
editCornetto is NL Wordnet and AATNED is NL AAT. I got another file saying:
<aatned_cornetto_mappings> a void:Linkset ; dcterms:title "AATNED-Cornetto mappings by Anna Tordai (baseline)"; lib:source <http://semanticweb.cs.vu.nl/lod/rkd/aatned/> ; void:dataDump <bl_aatned_cn.rdf.gz> , <bl_norm_aatned_cn.rdf.gz> , <bl_sing_aatned_cn.rdf.gz> .
Eg we have this for AAT 300191645 "salinity":
bl_aatned_cn.ttl: aatned:zoutheid skos:closeMatch cornetto:synset-zoutheid-1-noun . cornetto-wn20.ttl: cornetto:synset-zoutheid-1-noun cornetto:eqNearSynonym instances:synset-brininess-noun-1 . cornetto-wn30.ttl: cornetto:synset-zoutheid-1-noun cornetto:eqNearSynonym wn30:synset-brininess-noun-1 . aatned.ttl: aatned:zoutheid core:notation "300191645" .
The number of AATNED-Cornetto matches is as follows:
> cat bl*|sort|uniq> bl_aatned_all.ttl > wc -l bl_aatned_all.ttl 6917 bl_aatned_all.ttl > cat bl_aatned_all.ttl|cut -d " " -f 1 | sort| uniq | wc -l 6857
There are more matches than AAT-Wordnet. There are also overlaps: 60 AATNED concepts (0.9%) have two Cornetto matches.
We need to merge AATNED-Cornetto with AAT-Wordnet. The correlation is simply by id, eg
aatned.ttl: aatned:zwerfkeien core:notation "300011671" aat.ttl: aat:boulder vp:id "300011671"
I guess the overlaps between them are quite big, eg for wrought_iron:
aatned.ttl: aatned:smeedijzer core:notation "300011012" bl_aatned_all.ttl: aatned:smeedijzer skos:closeMatch cornetto:synset-smeedijzer-1-noun . cornetto-wn20.ttl:cornetto:synset-smeedijzer-1-noun cornetto:eqNearSynonym instances:synset-wrought_iron-noun-1 . cornetto-wn30.ttl:cornetto:synset-smeedijzer-1-noun cornetto:eqNearSynonym wn30:synset-wrought_iron-noun-1 .
DBpedia-Wordnet3 coref
editThe other problem is bigger:
- http://www.w3.org/2006/03/wn/wn20/instances/synset-wrought_iron-noun-1 is Wordnet 2.0 in 9-year old rep, with wn20schema:synsetId "113958999"
- while BabelNet: http://babelnet.org/rdf/s00081730n has
bn:s00081730n skos:exactMatch dbpedia:Wrought_iron, lemon-WordNet:wn30-14802262-n
- that uses a modern LEMON wordnet rep: http://lemon-model.net/lexica/pwn/wn30-14802262-n
- Note: Wordnet 3.1 has IDs like "14826432" and "wrought_iron%1:27:00::"
It doesn't look like Wordnet3 and Wordnet2 share any IDs; we'll deal with that in next section.
Lets first do some queries at http://babelnet.org/sparql/ to see what we can see. Look for DBpedia-Wordnet matches:
SELECT * WHERE { ?x skos:exactMatch ?y, ?z filter(strstarts(str(?y),"http://dbpedia.org/resource/")) filter(strstarts(str(?z),"http://lemon-model.net/lexica/pwn/")) } LIMIT 30
Download: https://www.dropbox.com/s/92gq5r1qm3yytkp/WN3toDBP.csv?dl=1.
It has 47607 rows like this (there's a decent chance this will cover the 6k AAT matches):
"http://babelnet.org/rdf/s00075206n","http://dbpedia.org/resource/Sundowner_(drink)","http://lemon-model.net/lexica/pwn/wn30-07913081-n" "http://babelnet.org/rdf/s00039711n","http://dbpedia.org/resource/Sonora_(genus)","http://lemon-model.net/lexica/pwn/wn30-01736256-n" "http://babelnet.org/rdf/s00070026n","http://dbpedia.org/resource/Sealskin","http://lemon-model.net/lexica/pwn/wn30-04160261-n"
Wordnet3-Wordnet2 coref
editSince Wordnet3 and Wordnet2 don't share any IDs, we can try to use Wordnet2-Wordnet3 coref made by Jacco van Ossenbruggen and Marc van Assem (VU University Amsterdam) in May 2010 with this VOID (manifest):
<wn30-wn20-mappings-jacco> a void:Linkset ; dcterms:title "synset-level mappings from Wordnet 3.0 to 2.0, created by jacco's code" ; lib:source <http://purl.org/vocabularies/princeton/wn30/> ; void:dataDump <label-child-matches.ttl.gz> , <label-childparent-matches.ttl.gz> , <label-instance-matches.ttl.gz> , <label-meronym-matches.ttl.gz>, <label-neargloss-matches.ttl.gz> , <label-parent-matches.ttl.gz> , <label-unique-matches.ttl.gz> , <nearlabel-matches.ttl.gz> , <glossmatches-m.ttl.gz> . <wn30-wn20-mappings-sense> a void:Linkset ; dcterms:title "synset-level mappings from Wordnet 3.0 to 2.0, created by Mark using the Princeton WordSense mappings" ; lib:source <http://purl.org/vocabularies/princeton/wn30/> ; void:dataDump <synset-matches-based-on-multiple-sense-mappings-princeton.ttl.gz> , <synset-matches-based-on-single-sense-mappings-princeton.ttl.gz> .
It's a complex affair consisting of many steps, but the major step (contributing 87% of all matches) is glossmatches-m.ttl that looks like
wn30:synset-wrought_iron-noun-1 terms:replaces instances:synset-wrought_iron-noun-1 .
And looking at wordnet-synset.ttl, we find the required wn30 ID:
wn30:synset-wrought_iron-noun-1 wn20schema:synsetId 114802262 .
AAT-Wikidata Sheets
editAfter much querying and manual cleaning (over a day of effort), I made some sheets in this google folder:
- AAT-DBpedia-Babelnet.xlsx: 3324 potential matches, fairly clean, but need checking by more people
- AAT-DBpedia-Babelnet-80-judged.xlsx: example of correct & incorrect matches
- AAT-Wikidata-25-judged.xlsx: example of correct & incorrect matches on Mix-n-Match
Notified participants of WikiProject Authority control: I need your help!
Notified participants of WikiProject Visual arts: Yours too!
- Do some checks (add your initials in column "check")
- Add Q numbers to the sheet
- Merge WD items that already have Art & Architecture Thesaurus ID (P1014) (there are 8477) to the sheet to compare the matches (or remove them from the sheet if you're quite confident)
I could post the sheet as QuickStatements, but I think there are still 10% incorrect matches, especially for Styles and Periods (see Wikidata talk:WikiProject Visual arts/Item structure/Art movements. --Vladimir Alexiev (talk) 16:02, 7 March 2017 (UTC)
AAT-LCSH coreferencing
edit445 AAT-LCSH coreferences made by Getty editors.
400 of them are on the Getty LOD site (see query below), 45 are newly extracted
select * { ?x skos:exactMatch|skos:closeMatch ?y. ?x skos:inScheme aat: filter not exists {?y skos:inScheme aat:}}
Geonames Feature Code
editNotified participants of WikiProject Authority control
Notified participants of WikiProject Companies WikiProject Cultural heritage has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
Notified participants of WikiProject Visual arts
GeoNames feature code (P2452) is applied only 33 times (see disscussion and archive, while there are 669 codes on Geonames.
I'll ask Magnus to add the Geonames list to Mix-n-Match.
Update Jul 2020: Things are much better at https://www.wikidata.org/wiki/Property:P2452:
- Fully matched 545 82.4%
- Preliminarily matched 7
- Unmatched 109 16.4%
- Total 661
- The actual total at http://www.geonames.org/export/codes.html is 680. --Vladimir Alexiev (talk) 12:31, 15 July 2020 (UTC)
http://www.geonames.org/ontology/mappings_v3.01.rdf has the following mappings:
32 dbo http://dbpedia.org/ontology/ 5 frgeo http://rdf.insee.fr/geo/ 79 lgdo http://linkedgeodata.org/ontology/ 31 schema http://schema.org/
Can we use them somehow to push this coreferencing further?
- I'm a little wary about importing these wholescale, because geonames in not a CC0 database. It's one thing to be providing external links to GeoNames, it's another to be importing data.
- I did look at these values recently for English places with geonames links that are marked as both village (Q532) and civil parish (Q1115575) (see eg Abberton (Q3137539) for an example), to identify which Geonames link corresponded to which role; but I purposely decided not to add a GeoNames feature code (P2452) statement.
- It might be useful to be able to map the codes to Q-numbers here, to facilitate sanity checking of exisiting or proposed co-references. However, even then there are difficulties -- for example, I found that PPLA3 or PPLA4 at Geonames didn't necessarily match to distinctions we would want to make in a instance of (P31) here. Jheald (talk) 14:02, 13 March 2017 (UTC)
Supplement Wikidata items with properties from authorities (GND in particular)
editData from DifferentiatedPersons of GND can be used to fill missing properties of according items, e.g.,
- date of birth/death (directly from gnd:dateOfBirth and gndo:dateOfDeath, for entries following YYYY or YYYY-MM-DD - everything else to be skipped)
- affiliation (P1416) can be obtained by a join of gndo:affiliation to wd organizations (may be sparse currently, but can be repeated later on)
- country (country (P17) or country of citizenship (P27)??) requires translation from gndo:geographicAreaCode, which refers to a customized code table derived from ISO 3166 (not part of GND) (table (pdf), rules)
- aliases - require filtering of gndo:variantNameForThePerson, which carry no language tag, re. script and presumed language (would Lingua::Identify work here?)
For appropriate source statements see project chat -- Jneubert (talk) 06:35, 21 May 2017 (UTC) (with thanks to User:MisterSynergy and User:ChristianKl)
- I am a big fan of standards but the ISO 3166 is used for modern countries, it does not as a consequence give the "nationality" of people who did precede a country. Thanks, GerardM (talk) 07:03, 21 May 2017 (UTC)
- NB yes there are some, but at Wikidata we know about many more former countries. GerardM (talk) 07:13, 21 May 2017 (UTC)
A "sibling" of Mix-n-Match now imports birth/death dates from authority files: https://www.wikidata.org/wiki/User:Magnus_Manske/Mix%27n%27match_date_import. Also see discussion about this in relation to Getty ULAN: https://groups.google.com/forum/#!topic/gettyvocablod/TkdelW9RP1g --Vladimir Alexiev (talk) 09:38, 2 October 2017 (UTC)
Property proposal for applying SKOS mapping relations to "external identifiers"
editIn order be able to map a thesaurus more completely, and - more general - to make Wikidata fit as a linking hub for knowledge organiziation systems, I've proposed a new property which allows to qualify individual links by properties of type "external identifier" as in-exact (close/broad/narrow/related) match.
Please feel free to comment at https://www.wikidata.org/wiki/Wikidata:Property_proposal/mapping_relation_type.
Cheers, Jneubert (talk) 12:27, 28 August 2017 (UTC)
Grant proposal soweego
editNotified participants of WikiProject Authority control
There is a new grant proposal soweego for authority control. See discussion at https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego#Endorsements.
I've considered it seriously but I think it doesn't address the main problem, see a list of 11 considerations (which can also be read as a sort of programme for next important steps for WD authority control). Please express your opinion there. --Vladimir Alexiev (talk) 10:19, 2 October 2017 (UTC)
- Where is this discussion with 11 considerations? Thanks, GerardM (talk) 12:22, 2 October 2017 (UTC)
Remove Obsolete Getty Vocabularies IDs
editGetty Vocabs have 8k obsolete subjects (identifiers). You can find them at http://vocab.getty.edu/sparql with a query like
select * { ?old a gvp:ObsoleteSubject; skos:inScheme aat: . optional {?old dct:isReplacedBy ?new} }
Here are the numbers: I can put up the files somewhere: obsolete-AAT-2106.tsv, obsolete-TGN-1016.tsv, obsolete-ULAN-5574.tsv.
Notified participants of WikiProject Authority control
Any takers to replace any old values of AAT ID Search, TGN ID Search, ULAN ID Search with the new values from the respective files?
@Magnus Manske: rdf:type gvp:ObsoleteSubject should be removed from Mix-n-Match consideration.
- Done, but only changed 5 entries for the ~2100 obsolete ones, the others were either already N/A or matched with an item. --Magnus Manske (talk) 15:22, 21 February 2018 (UTC)
- Data should not be removed, but - if necessary - marked as deprecated, with a qualifier giving the reason for deprecation. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:02, 26 February 2018 (UTC)
VIAF Games
editNotified participants of WikiProject Authority control
Notified participants of WikiProject Names
Notified participants of WikiProject Visual arts WikiProject Cultural heritage has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
Anyone interested in Authority Control, please read this section. It provides info about VIAF volumetrics, WD-VIAF volumetrics (very low!), correlation VIAF code - WD entity&prop, and an idea how to proliferate identifiers.
VIAF volumetrics
editWe at Ontotext are developing a VIAF reconciliation service, and here are some statistics about their dump 2018-07 (note: this dump is missing links, see below, so it may be missing entities too):
Type | Count | Comment |
---|---|---|
skos:Concept | 54,741,718 | Contributor (per-source) node, 1.7x per entity (cluster) |
foaf:Document & ont:InformationResource | 32,254,690 | Entities (clusters), or RDF docs describing each cluster |
schema:Person | 21,034,181 | persons |
schema:Organization | 4,359,408 | organizations. Pure orgs (not places): 4,249,192 |
schema:Place | 972,703 | places |
schema:Place & schema:Organization | 110,216 | Eg viaf:233665742 Chile: country and its government |
schema:CreativeWork | 4,110,185 | creative work (uniform title) |
bib:Agent | 2,163,020 | "placeholder URI in need of further matching" |
wd:Q1387388 | 2,148,663 | "chimera: mythological hybrid creature combining body parts of more than one real species. VIAF URI corrupted by an 'undifferentiate name' and should be treated as unusable" |
pto:Pseudonym | 4,639 | A person with a pseudonym |
Wikidata to VIAF Coreferencing
editNumber of WD entities with VIAF id, per class (2019-11).
- 9.8k entity types have VIAF ids, the top 500 are shown below
- WD has 6,215,688 humans, of which 1,571,460 are coreferenced. So about 5.2% (?) of VIAF persons are coreferenced, and 25.28% of WD persons are coreferenced. This is the same percentage as 2018-04, but the absolute number has increased about 50%
- WD has 2,339,103 organizations, of which 327,694 are coreferenced, so 5% (?) of VIAF orgs are co-referenced, and 14% of WD orgs are coreferenced. This is a significant increase of the percentage in 2018-04 (8.4%), and the absolute number has increased 110%
- I believe a similar percent of places are coreferenced: between 15 and 25%
--Vladimir Alexiev (talk) 08:23, 27 November 2019 (UTC)
Counts:
- human (Q5): 1571460
- musical group (Q215380): 27559
- business (Q4830453): 23100
- organization (Q43229): 14913
- human settlement (Q486972): 14567
- commune of France (Q484170): 12508
- Ortsteil (Q253019): 11467
- municipality in Germany (Q262166): 9886
- museum (Q33506): 8905
- church (Q16970): 8575
- university (Q3918): 7101
- village (Q532): 6936
- city in the United States (Q1093829): 6510
- municipality seat (Q15303838): 6383
- river (Q4022): 5805
- commune of Italy (Q747074): 5687
- city (Q515): 4899
- enterprise (Q6881511): 4571
- town (Q3957): 4417
- municipality of Spain (Q2074737): 4298
- government agency (Q327333): 4129
- publishing company (Q2085381): 4087
- mountain (Q8502): 3788
- nonprofit organization (Q163740): 3751
- political party (Q7278): 3560
- art museum (Q207694): 3358
- hospital (Q16917): 3198
- high school (Q9826): 3194
- research institute (Q31855): 3148
- unincorporated community in the United States (Q17343829): 3024
- town in the United States (Q15127012): 3023
- school (Q3914): 2866
- island (Q23442): 2579
- village of Poland (Q3558970): 2545
- voluntary association (Q48204): 2538
- municipality of Switzerland (Q70208): 2519
- census-designated place in the United States (Q498162): 2419
- theatre building (Q24354): 2364
- building (Q41176): 2302
- open-access publisher (Q45400320): 2272
- civil parish (Q1115575): 2107
- film (Q11424): 2098
- dissolved municipality of Japan (Q18663566): 2080
- association football club (Q476028): 2024
- municipality of Austria (Q667509): 2010
- archaeological site (Q839954): 1967
- library (Q7075): 1962
- second-level administrative division (Q13220204): 1881
- lake (Q23397): 1857
- educational institution (Q2385804): 1801
- architectural structure (Q811979): 1755
- locality (Q3257686): 1747
- literary work (Q7725634): 1739
- castle (Q23413): 1703
- urban municipality in Germany (Q42744322): 1602
- neighborhood (Q123705): 1511
- magazine (Q41298): 1509
- mountain system (Q46831): 1488
- foundation (Q157031): 1478
- monastery (Q44613): 1429
- musical ensemble (Q2088357): 1412
- book publisher (Q1320047): 1360
- laboratory (Q483242): 1287
- municipality of Brazil (Q3184121): 1262
- village in the United States (Q751708): 1262
- sports club (Q847017): 1231
- written work (Q47461344): 1228
- private not-for-profit educational institution (Q23002054): 1218
- academic institution (Q4671277): 1214
- record label (Q18127): 1211
- municipality of the Czech Republic (Q5153359): 1207
- château (Q751876): 1184
- faculty (Q180958): 1183
- public educational institution of the United States (Q23002039): 1173
- park (Q22698): 1168
- rock band (Q5741069): 1160
- parish (Q102496): 1153
- pressure group (Q1666019): 1122
- commune community (Q423785): 1102
- hill (Q54050): 1086
- rural municipality of Austria (Q1802801): 1085
- opera (Q1344): 1083
- abbey (Q160742): 1075
- palace (Q16560): 1070
- military unit (Q176799): 1050
- orchestra (Q42998): 1049
- market municipality (Q562061): 1048
- canton of France (Q184188): 1040
- radio station (Q14350): 1036
- valley (Q39816): 1035
- municipality of Slovakia (Q6784672): 1032
- noble family (Q13417114): 1003
- film production company (Q1762059): 994
- labor union (Q178790): 973
- association (Q15911314): 966
- choir (Q131186): 961
- scientific journal (Q5633421): 953
- real property (Q684740): 950
- monument (Q4989906): 941
- secondary school (Q159334): 938
- school district (Q398141): 933
- archive (Q166118): 932
- bay (Q39594): 922
- college (Q189004): 914
- railway station (Q55488): 910
- book (Q571): 898
- learned society (Q955824): 880
- higher education institution (Q38723): 875
- street (Q79007): 855
- musical duo (Q9212979): 848
- municipality of Catalonia (Q33146843): 844
- cultural property (Q2065736): 840
- public library (Q28564): 836
- airport (Q1248784): 835
- newspaper (Q11032): 813
- company (Q783794): 803
- frazione (Q1134686): 794
- public university (Q875538): 769
- institute (Q1664720): 762
- medical organization (Q4287745): 761
- transport company (Q740752): 749
- reservoir (Q131681): 737
- forest (Q4421): 731
- town of Japan (Q1059478): 727
- city or town (Q7930989): 715
- cadastral populated place in the Netherlands (Q1852859): 712
- delegated commune (Q21869758): 709
- city of Japan (Q494721): 709
- memory institution (Q1497649): 695
- scientific society (Q748019): 690
- municipality of Japan (Q1054813): 685
- borough of Pennsylvania (Q777120): 683
- municipality of Hungary (Q2590631): 680
- stream (Q47521): 679
- composed musical work (Q207628): 677
- Counties of China (Q1289426): 671
- cemetery (Q39614): 660
- border city (Q902814): 654
- video game developer (Q210167): 643
- railway line (Q728937): 637
- municipality of the Netherlands (Q2039348): 631
- house (Q3947): 630
- former municipality of Switzerland (Q685309): 609
- historical country (Q3024240): 597
- bank (Q22687): 593
- municipality (Q27676428): 588
- Buddhist temple (Q5393308): 586
- theatre company (Q742421): 586
- art gallery (Q1007870): 571
- sports governing body (Q2485448): 570
- botanical garden (Q167346): 567
- watercourse (Q355304): 565
- cadastral area in the Czech Republic (Q20871353): 562
- play (Q25379): 561
- architectural firm (Q4387609): 554
- concentration camp (Q152081): 546
- geographic region (Q82794): 545
- movie theater (Q41253): 538
- district of Germany (Q106658): 533
- municipality of Belgium (Q493522): 530
- diocese of the Catholic Church (Q3146899): 516
- parish church (Q317557): 511
- municipality (Q15284): 510
- non-governmental organization (Q79913): 506
- township of Pennsylvania (Q9035798): 505
- national park (Q46169): 501
- primary school (Q9842): 499
- hotel (Q27686): 496
- fourth-level administrative division in Indonesia (Q2225692): 495
- district of India (Q1149652): 493
- municipality of Poland (Q15334): 489
- Gymnasium in Germany (Q1542966): 486
- quarter (Q2983893): 485
- square (Q174782): 482
- road (Q34442): 481
- municipal district (Q2198484): 478
- Shinto shrine (Q845945): 475
- national museum (Q17431399): 473
- archaeological museum (Q3329412): 472
- tourist attraction (Q570116): 470
- chapel (Q108325): 469
- protected area (Q473972): 464
- canal (Q12284): 451
- French UMR (Q3550864): 448
- municipality with town privileges in the Czech Republic (Q15978299): 446
- cultural heritage (Q210272): 444
- airline (Q46970): 439
- private university (Q902104): 436
- cathedral (Q2977): 436
- mine (Q820477): 435
- urban area in Sweden (Q12813115): 434
- freguesia of Portugal (Q1131296): 433
- National Wildlife Refuge (Q1410668): 432
- Cooperative Science and Research Body (Q11507944): 431
- television station (Q1616075): 429
- public company (Q891723): 428
- cultural center (Q1329623): 426
- former hospital (Q64578911): 423
- family (Q8436): 420
- international organization (Q484652): 417
- peninsula (Q34763): 413
- cultural institution (Q5193377): 412
- municipality of Mexico (Q1952852): 411
- local museum (Q1595639): 406
- municipality section (Q2785216): 406
- architectural ensemble (Q1497375): 400
- cultural heritage site in Russia (Q8346700): 397
- business school (Q1143635): 386
- fictional human (Q15632617): 378
- political organization (Q7210356): 376
- Scottish civil parish (Q5124673): 375
- historic district (Q15243209): 374
- district of Turkey (Q1147395): 373
- charitable organization (Q708676): 372
- academic library (Q856234): 371
- Act of Congress in the United States (Q476068): 370
- Naturschutzgebiet (Q759421): 370
- duo (Q10648343): 368
- community college (Q1336920): 368
- opera house (Q153562): 368
- county seat (Q62049): 367
- prison (Q40357): 366
- minor basilica (Q120560): 362
- regulatory college (Q1110684): 361
- Amt (Q478847): 361
- township of Ohio (Q17198620): 355
- gymnasium (Q55043): 355
- geographic township of Quebec (Q23019040): 354
- advocacy group (Q431603): 354
- villa (Q3950): 354
- art academy (Q383092): 353
- local heritage association in Sweden (Q61786815): 353
- synagogue (Q34627): 353
- school building (Q1244442): 352
- military museum (Q2772772): 352
- capital city (Q5119): 352
- junior college (Q370258): 351
- suburb (Q188509): 347
- think tank (Q155271): 344
- natural watercourse (Q55659167): 342
- big city (Q1549591): 340
- canton of France (Q18524218): 339
- municipality of Croatia (Q57058): 337
- cadastral municipality of Austria (Q17376095): 336
- facility (Q13226383): 333
- place with town rights and privileges (Q13539802): 333
- municipality with authorized municipal office (Q7841907): 331
- district of Japan (Q1122846): 329
- zoo (Q43501): 327
- US Wilderness Area (Q27995042): 324
- ministry (Q192350): 323
- locality of Mexico (Q20202352): 323
- geographical feature (Q618123): 323
- New England town (Q2154459): 322
- Catholic religious institute (Q5135744): 322
- medical school (Q494230): 320
- seminary (Q233324): 317
- railway company (Q249556): 317
- palazzo (Q2651004): 315
- research center (Q7315155): 312
- Christian denomination (Q879146): 311
- seat of the local council (Q34841063): 303
- association under the French law of 1901 (Q11513034): 302
- cooperative bank (Q3277997): 302
- string quartet (Q207338): 300
- symphony (Q9734): 300
- historical society (Q5774403): 295
- municipality of Portugal (Q13217644): 294
- island group (Q1402592): 291
- international airport (Q644371): 290
- university of applied sciences (Q1365560): 288
- law school (Q1321960): 286
- boarding school (Q269770): 285
- umbrella organization (Q1156831): 284
- Rathaus (Q543654): 284
- private educational institution (Q23002042): 283
- hamlet (Q5084): 283
- dam (Q12323): 277
- cave (Q35509): 275
- middle school (Q149566): 273
- municipality of Galicia (Q2276925): 273
- town in Hungary (Q13218690): 272
- artist collective (Q1400264): 270
- administrative territorial entity of Russia (Q192287): 268
- concert hall (Q1060829): 267
- castle ruin (Q17715832): 267
- former municipality (Q19730508): 267
- urban area in Norway (Q15092344): 265
- metro station (Q928830): 265
- television channel (Q2001305): 264
- port (Q44782): 264
- single entity of population (Q3055118): 260
- convent (Q1128397): 259
- destroyed building or structure (Q19860854): 259
- village in India (Q56436498): 259
- district of Prussia (Q5283531): 258
- Sparkasse (Q13601825): 257
- town of Portugal (Q19833170): 257
- county of Texas (Q11774097): 255
- notname (Q1747829): 254
- periodical (Q1002697): 253
- television series (Q5398426): 253
- bight (Q17018380): 252
- municipality of Colombia (Q2555896): 252
- abandoned village (Q350895): 252
- daily newspaper (Q1110794): 249
- literary character (Q3658341): 249
- national library (Q22806): 249
- district (China) (Q1065118): 243
- municipality of the Philippines (Q24764): 243
- commune of Romania (Q659103): 242
- eingetragener Verein (Q9299236): 242
- Bach cantata (Q1369421): 240
- city or town (Q27676416): 239
- historic house museum (Q2087181): 237
- brewery (Q131734): 236
- village of Ukraine (Q21672098): 234
- sovereign state (Q3624078): 234
- public school (Q1080794): 233
- astronomical observatory (Q1254933): 232
- ghost town (Q74047): 230
- mountain pass (Q133056): 229
- municipal archive (Q604177): 228
- cape (Q185113): 227
- conservatory (Q184644): 224
- academy of sciences (Q414147): 224
- ayuntamiento (Q22996476): 223
- urban municipality (Q2616791): 223
- automobile manufacturer (Q786820): 223
- regency of Indonesia (Q3191695): 222
- Gemarkung (Q1499928): 221
- history museum (Q16735822): 221
- archdiocese (Q2072238): 219
- city of Oregon (Q63440326): 218
- credit institution (Q730038): 218
- administrative territorial entity (Q56061): 218
- fort (Q1785071): 215
- government organization (Q2659904): 215
- constituent locality (Q15921247): 214
- television production company (Q10689397): 213
- ruins (Q109607): 213
- trade association (Q2178147): 213
- engineering college (Q1663017): 212
- Catholic cathedral (Q56242215): 210
- Hanseatic city (Q707813): 208
- district of Indonesia (Q3700011): 205
- restaurant (Q11707): 205
- art group (Q4502119): 204
- academy (Q162633): 203
- municipal part of the Czech Republic (Q61089180): 203
- professional association (Q829080): 203
- museum building (Q24699794): 202
- village of Japan (Q4174776): 202
- university press (Q479716): 201
- central bank (Q66344): 201
- student society (Q1685451): 200
- underground station (Q22808403): 199
- ancient city (Q15661340): 198
- country (Q6256): 198
- lagoon (Q187223): 197
- independent school (Q2418495): 197
- community (Q2630741): 197
- archipelago (Q33837): 197
- embassy (Q3917681): 196
- borough of New Jersey (Q2911266): 192
- film studio (Q375336): 192
- musical work (Q2188189): 191
- mythological Greek character (Q22988604): 191
- municipality of Finland (Q856076): 191
- science museum (Q588140): 190
- fixed construction (Q811430): 190
- song (Q7366): 190
- academic department (Q2467461): 189
- road bridge (Q537127): 189
- environmental organization (Q1785733): 188
- prefecture-level city (Q748149): 188
- ice hockey team (Q4498974): 187
- symphony orchestra (Q239582): 186
- municipality of Georgia (Q76514543): 186
- village of Bulgaria (Q15630849): 185
- academic journal (Q737498): 185
- glacier (Q35666): 185
- air base (Q695850): 183
- collective pseudonym (Q16017119): 182
- former local government area of Australia (Q30129411): 181
- area of London (Q2755753): 180
- regiment (Q52371): 180
- neighborhood of Brazil (Q19658107): 178
- plain (Q160091): 177
- private limited company (Q18624259): 177
- arts center (Q2190251): 177
- law enforcement agency (Q732717): 176
- natural history museum (Q1970365): 175
- powiat (Q247073): 174
- library network (Q28324850): 174
- performing arts center (Q3469910): 174
- city of Switzerland (Q54935504): 174
- trade union federation (Q1391517): 172
- factory (Q83405): 172
- human whose existence is disputed (Q21070568): 171
- former administrative territorial entity (Q19953632): 170
- desa (Q26211545): 170
- printing company (Q6500733): 170
- municipality of Norway (Q755707): 170
- municipality of Sweden (Q127448): 168
- basketball team (Q13393265): 168
- girl group (Q641066): 168
- strait (Q37901): 168
- skyscraper (Q11303): 167
- historic site (Q1081138): 166
- institute of the Russian Academy of Sciences (Q4201890): 166
- Verbandsgemeinde (Q23006): 165
- mosque (Q32815): 165
- university library (Q1622062): 164
- statistical service (Q480242): 164
- pseudonym (Q61002): 164
- agglomeration community (Q159321): 163
- raion of Ukraine (Q1267632): 162
- sanctuary (Q29553): 162
- county of Georgia (Q13410428): 161
- financial institution (Q650241): 161
- abolished municipality in Italy (Q3685476): 160
- small burgh (Q7543008): 160
- shopping center (Q11315): 160
- department of France (Q6465): 160
- academy school (Q4671329): 159
- Belgian municipality with the title of city (Q15273785): 158
- village of Wisconsin (Q15221242): 157
- kibbutz (Q161387): 157
- national archives (Q2122214): 157
- vocational school (Q322563): 157
- state-owned enterprise (Q270791): 156
- comics character (Q1114461): 155
- wadi (Q187971): 155
- beach (Q40080): 154
- convention center (Q1378975): 153
- parish village (Q1493533): 153
- commune of Chile (Q1840161): 153
- municipality of Guatemala (Q1872284): 153
- manuscript (Q87167): 153
- unicameral legislature (Q37002670): 152
- Ortsbezirk of Germany (Q163301): 151
- United States National Forest (Q612741): 151
- bus company (Q10438042): 150
- university building (Q19844914): 150
- rural commune of Vietnam (Q2389082): 150
- painting (Q3305213): 149
- public research university (Q62078547): 149
- city of Portugal (Q15647906): 147
- shipyard (Q190928): 146
- military division (Q169534): 144
- military academy (Q917182): 144
- novel (Q8261): 142
- religious organization (Q1530022): 141
- township of Michigan (Q17205774): 140
- microregion (Q11781066): 139
- Catholic seminary (Q14911880): 139
- mahalle (Q17051044): 139
- arrondissement of France (Q194203): 139
- parish municipality (Q27676524): 139
- city council (Q3154693): 138
- aid agency (Q336473): 138
- music festival (Q868557): 138
- Studentenverbindung (Q1779527): 137
- human biblical figure (Q20643955): 137
- parish of Asturias (Q55102916): 137
- open-air museum (Q756102): 137
- Japanese high school (Q56351315): 136
- manor house (Q879050): 136
- independent city of Germany (Q22865): 136
- German Student Corps (Q14515311): 135
- historical region (Q1620908): 135
- sibling duo (Q14073567): 134
- point (Q24529780): 134
- university hospital (Q1059324): 133
- spring (Q124714): 133
- maritime museum (Q1863818): 133
- provincial park of Canada (Q2006279): 133
- summit (Q207326): 133
- public educational institution (Q23002037): 133
- institution (Q178706): 132
- cultural heritage site in Slovenia (Q18564289): 132
- rural municipality (Q3504085): 132
- subsidiary (Q658255): 132
- baseball team (Q13027888): 131
- district (Q149621): 131
- trademark (Q167270): 131
- lighthouse (Q39715): 131
- urban park (Q22746): 130
- golf course (Q1048525): 129
- institute of technology (Q1371037): 129
- folk high school (Q170087): 129
- Samtgemeinde (Q447523): 129
- tower (Q12518): 129
- cultural heritage ensemble (Q1516079): 128
- Narrenzunft (Q1965390): 128
- intergovernmental organization (Q245065): 128
- city of district significance (Q12131640): 127
- general education liceum (Q3280746): 127
- provincial electoral district of Quebec (Q2973931): 126
VIAF Source Codes
editTo do coreferencing work, we need to correlate VIAF source codes (eg JPG) to WD props (eg P245). We can find some of them with this query, and I found the others at the list of sources on http://viaf.org/.
select ?entity ?entityLabel ?prop ?propLabel ?code ?countryLabel {
?prop wdt:P1552 wd:Q26921380.
optional {
?prop wdt:P1629 ?entity
optional {?entity wdt:P17 ?country}
}
optional {
?prop wdt:P3303 ?formatter
filter(regex(?formatter,"https://viaf.org/processed/(.*)\\|\\$1"))
bind(replace(?formatter,"https://viaf.org/processed/(.*)\\|\\$1","$1") as ?code)
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} order by ?propLabel
The sources are listed in the next section.
--Vladimir Alexiev (talk) 16:15, 22 August 2018 (UTC)
VIAF Links per Source
editCounted with viaf-links-count.pl, see this gist. I also tried counting WD external-ids with SPARQL (.rq) but those queries timed out. HELP NEEDED.
- 2016-12: 86,698,786 links
- 2018-07: 54,741,718 links. This decrease is strange. The 2018-07 VIAF dump is a fluke, don't use it
- LC dropped from 16M to 10M, and DNB from 26M to 9M.
- VIAF dropped Wikipedia links (not fatal since WD has all possible WP links)
- VIAF lists Wikidata on their website, but such links are not in the dump (checked RDF data)
- VIAF dropped these sources: GeoNames, IMAGINE, ORCID
- 2018-08: 85,598,365. Fixed these problems, and added some sources: BLBNB, DE663, ERRR
- 2019-11: see User_talk:Vladimir_Alexiev#VIAF_-_update, hopefully those numbers will be integrated below. You will notice some new types of ID: I linked them to existing IDs or created property proposals:
- ARBABN (Argentina): BNMM authority ID (P3788)
- BLBNB: National Library of Brazil ID (P4619)
- CAOONL (Library and Archives Canada): Canadiana Authorities ID (former scheme) (P1670) (obsolete?)
- PLWABN (Poland): NLP ID (old) (P1695)
- SIMACOB (NUK/COBISS.SI, Slovenia): CONOR.SI ID (P1280)
- Identities: Wikidata:Property_proposal/WorldCat_Identities
- IMAGINE: Israel Museum Jerusalem artist ID (P7681)
- SKMASNL: Slovak National Library (VIAF) ID (P7700)
- LIH: National Library of Lithuania ID (P7699)
2016-12 | 2018-07 | 2018-08 | code | entity | property | country |
---|---|---|---|---|---|---|
359497 | 366280 | 369712 | B2Q | Bibliothèque et Archives nationales du Québec (Q39628) | P3280 | Canada>Quebec |
369985 | 369985 | 369987 | BAV | Vatican Library (Q213678) | P1017 | Vatican |
286192 | 300088 | BLBNB | National Library of Brazil (Q948882) | P4619 | Brazil | |
104455 | 1795284 | 1797032 | BIBSYS | BIBSYS (Q4584301) | P1015 | Norway |
365936 | 185314 | 370676 | BNC | Name and Title Authority File of Catalonia (Q8342938) | P1273 | Spain>Catalonia |
412032 | 205976 | 412034 | BNCHL | Collective Catalog of Bibliographic Authorities of Chile (Q19896851) | P7369 | Chile (not P1890) |
620739 | 642770 | 646521 | BNE | Biblioteca Nacional de España (Q750403) | P950 | Spain |
4562204 | 2395213 | 4806862 | BNF | Bibliothèque nationale de France (Q193563) | P268 | France |
29007 | 29007 | 29009 | BNL | National Library of Luxembourg (Q856651) | P7028 | Luxembourg |
54935 | 79128 | 79991 | CYT | National Central Library (Q618340) | P1048 | Taiwan |
111686 | 111686 | DE663 | Répertoire International des Sources Musicales (Q2178828) | P5504 | intl (RISM) | |
282952 | 162634 | 327462 | DBC | Danish Bibliographic Centre (Q12307383) | P3846 | Denmark |
26490921 | 8866855 | 17785894 | DNB | Integrated Authority File (Q36578) | P227 | DE, AT, CH |
46201 | 52350 | 52350 | EGAXA | Bibliotheca Alexandrina (Q501851) | P1309 | Egypt |
57791 | 57794 | ERRR | National Library of Estonia (Q609471) | P6394 | Estonia | |
178835 | 179726 | 180164 | FAST | Faceted Application of Subject Terminology (Q3294867) | P2163 | intl (OCLC) |
152003 | 165845 | GeoNames | GeoNames (Q830106) | P1566 | intl (places) | |
343166 | 173713 | 347426 | ICCU | National Library Service of Italy (Q576951) | P396 | Italy |
9941 | 9964 | IMAGINE | !!? | |||
7741093 | 8187394 | 8187405 | ISNI | International Standard Name Identifier (Q423048) | P213 | intl (authors) |
241631 | 264833 | 264834 | JPG | Union List of Artist Names (Q2494649) | P245 | US Getty |
240857 | 331791 | 331797 | KRNLK | National Library of Korea (Q495005) | P5034 | Korea |
718263 | 726993 | 730007 | LAC | Library and Archives Canada (Q913250) | P1670 | Canada |
16256543 | 10156417 | 16912675 | LC | Library of Congress Authorities (Q13219454) | P244 | US, UK, MX, ZA, NZ |
186566 | 205260 | 205260 | LNB | National Library of Latvia (Q1133733) | P1368 | Latvia |
14768 | 15876 | 15879 | LNL | Lebanese National Library (Q2901488) | P7026 | Lebanon |
6299 | 8567 | 8569 | MRBNR | National Library of the Kingdom of Morocco (Q2901478) | P7058 | Morocco |
227699 | 227651 | 227707 | N6I | National Library of Ireland (Q1672830) | P1946 | Ireland |
1079274 | 1114415 | 1116232 | NDL | Web NDL Authorities (Q2553334) | P349 | Japan |
1617979 | 1712963 | 1712980 | NII | CiNii (Q10726338) | P271 | Japan |
817783 | 872953 | 879014 | NKC | Czech National Authority Database (Q13550863) | P691 | Czech Rep |
1036908 | 1111676 | 1114335 | NLA | National Library of Australia (Q623578) | P409 | Australia |
608 | 3940 | 3940 | NLB | National Library Singapore (Q890364) | P3988 | Singapore |
850304 | 1788698 | 1810067 | NLI | National Library of Israel (Q188915) | P949 | Israel |
1219045 | 1482467 | 1494282 | NLP | National Library of Poland (Q856423) | P1695 | Poland |
245488 | 147268 | 294546 | NLR | Russian State Library (Q1048694) | P7029 | Russia (not P947) |
508256 | 537830 | 539669 | NSK | National and University Library in Zagreb (Q631375) | P1375 | Croatia>Zagreb |
67454 | 33727 | 67454 | NSZL | National Széchényi Library (Q1063819) | P951 | Hungary |
2650671 | 2700697 | 2703554 | NTA | KB National Library of the Netherlands (Q1526131) | P1006 | Netherlands |
3090182 | 1731205 | 3482894 | NUKAT | NUKAT (Q11789729) | P1207 | Poland>WarsawU |
7389 | 17359 | ORCID | ORCID iD (Q51044) | P496 | intl (researchers) | |
1228 | 1228 | 1228 | PERSEUS | Perseus Digital Library (Q639661) | P7041 | intl (ancient places) |
404594 | 423917 | 423920 | PTBNP | National Library of Portugal (Q245966) | P1005 | Portugal |
237240 | 253180 | 253914 | RERO | Library Network of Western Switzerland (Q3456783) | P3065 | Switzerland |
205630 | 215367 | 215798 | SELIBR | LIBRIS (Q1798125) | P906 | Sweden |
209 | 209 | 209 | SRP | syriaca.org (Q64866015) | P6934 | Syriac Reference Portal |
2927879 | 3281884 | 3300486 | SUDOC | IdRef (Q47757534) | P269 | France academic |
75989 | 124465 | 124466 | SWNL | Integrated Authority File (Q36578) | P227 | Switzerland (was SZ; part of GND) |
85956 | 85958 | UIY | National and University Library of Iceland (Q627423) | P7039 | Iceland (NULI) | |
8768 | 10452 | 10452 | VLACC | OpenVlacc (Q24247813) | P7024 | Flemish Public Libraries |
125635 | 126826 | W2Z | National Library of Norway (Q924551) | P1015 | Norway | |
1286035 | 1570365 | WKP | Wikidata entities | |||
6272821 | 7105991 | Wikipedia | Wikipedia links | |||
528 | 1102 | 1140 | XA | Extended Authorities | n/a | OCLC manual |
2069996 | 2026405 | 2036656 | XR | Extended Relationships | n/a | OCLC manual |
--Vladimir Alexiev (talk) 10:13, 24 August 2018 (UTC)
VIAF Help Needed
editNotified participants of WikiProject Authority control
Notified participants of WikiProject Names
Notified participants of WikiProject Visual arts WikiProject Cultural heritage has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
- If you sort by col "property" and look for "!!!", this represents VIAF sources that don't have corresponding WD properties (which should be marked as VIAF component (Q26921380). Wikidata:Property_proposal/Authority_control#missing_VIAF_components suggests to create them.
- Please help me verify props in the table above marked "?" or cases where multiple props are listed (although VIAF has both persons and works, maybe from a certain source it takes only persons)
- Do you think we should expand the Qnnn and Pnnn above using the respective macros, so they show up with their names?
- Do you think we should record col "code" in WD in some way? A few of these are available as third-party formatter URL (P3303), but most aren't. It took me a few hours to correlate all VIAF-component sources against WD, and it wasn't fun. Guess we can add it as qualifier? Eg
- And most importantly, help me with "id proliferation" as described below. Please comment on that section!
--Vladimir Alexiev (talk) 08:25, 23 August 2018 (UTC)
Resolve Against VIAF Links
editWikidata authority IDs are obtained from various places, and hand-edited. But http://viaf.org/viaf/data/viaf-20150115-links.txt.gz has links to IDs of VIAF participants (constituent national libraries) and Wikipedia.
Stats:
- 2Gb unzipped size
- 27684634 subjects (27.7M)
- 1.67 links per subject
- 46248396 links, of which 377650 enwiki links.
- Link breakdown:
27684634 VIAF total subjects 10531522 DNB Germany 9154093 LC LC (NACO) 7655649 ISNI ISNI 2555033 NTA Netherlands 2508374 SUDOC France (Sudoc) 2036493 BNF France (BnF) 2018647 XR xR OCLC file 1351105 NUKAT Poland (NUKAT) 1032862 NDL Japan (NDL) 1016708 NLA Australia 844024 NLP Poland (Nat lib) 743215 NKC Czech 689827 LAC Canada 570840 NLI Israel 562244 BNE Spain 473518 NSK Croatia 377650 WKP Wikipedia 373078 PTBNP Portugal 320898 BAV Vatican 232327 JPG Getty (ULAN) 220304 RERO Swiss (RERO) 187073 SELIBR Sweden 169028 ICCU Italy 158515 LNB Latvia 144299 BNC Catalunya 101500 DBC Denmark (DBC) 73421 BIBSYS Norway 45633 SWNL Swiss (Nat lib) 37004 EGAXA Egypt 33727 NSZL Hungary 11000 LNL Lebanon 9953 IMAGINE 5723 VLACC Belgium (Flemish) 1228 PERSEUS Perseus 997 RSL Russia 408 NLB Singapore 267 XA xA OCLC file 209 SRP Syriac
Now let's count some IDs in Wikidata, using WDQ API and working down the list:
- VIAF: 504736
- would be nice to cross-check this against the 377650 enwiki links
- Wikidata has 13M items and VIAF has 27.7M subjects, so I would expect at least 3-4M common subjects. This means that we have co-referencing for only 15% of the possible items! A lot of work remains
- GND: 335883
- 37% of all VIAF items have GND id, while in Wikidata the ratio is 66%. This means that GND co-referencing is more advanced than VIAF co-referencing
- VIAF or GND: 567240
- Everything in GND has a VIAF ID, but this shows that in Wikidata, 62504 with GND id don't have a VIAF id. We can assign VIAF id to these easily!
- LCNAF: 210845
- VIAF or LCNAF: 506136
- Again, we can leverage LCNAF ids to assign 1400 VIAF ids easily
- RKDartists: 21760: most of these can be added to VIAF for free!
- NTA: 335883
- SUDOC: 103120
More importantly: we could cross-check all 36 ID's in VIAF against Wikidata to:
- add the missing ones,
- flag different ones with a qualifier "questionable".
- unsigned comment by user:Vladimir Alexiev 16:33, 23 January 2015
- I imported the NTA stuff some time ago. Could easily do the same for other properties, but I don't know how good the data is and if I need to do some transformations. Please start a wikiproject. Would love to comment over there and get everything imported. Multichill (talk) 21:20, 26 January 2015 (UTC)
- User:Multichill Thank you sir/lady, will do. Chill :-)
- I imported the NTA stuff some time ago. Could easily do the same for other properties, but I don't know how good the data is and if I need to do some transformations. Please start a wikiproject. Would love to comment over there and get everything imported. Multichill (talk) 21:20, 26 January 2015 (UTC)
- unsigned comment by user:Vladimir Alexiev 16:33, 23 January 2015
- Some remarks:
- VIAF contains "name authorities" (historically persons first, later then corporate bodies, finally "geographic" items, works and "expressions" too. However "Subject Headings" ("topical terms" as in LCSH against LCNAF) are not part of VIAF, although some of the authority files (like GND) include them. Thus "X completely included in VIAF" does not hold for all constituent files of VIAF.
- Please be careful with respect to the ODC-BY license which reigns the VIAF dumps: Whereas the constituent authority files (GND for sure) are CC-0, VIAF definitely is not. (My interpretation: Import on a case-by-case basis is o.k., everything more must make certain that the license is met. Thus it fr.wikipedia would present SUDOC numbers fetched from Wikidata which here would have been collected by bulk-matching existing GND numbers against VIAF, french wikipedia would be obliged to present VIAF attribution or VIAF identifiers next to the SUDOC numbers.
- OCLC some years ago made a matching between VIAF and en.wikipedia.org and afterwards donated the data such that VIAF numbers could be imported into English Wikipedia (afterwards some cross-check against the VIAF numbers in corresponding articles of de.wikipedia was performed) from where they moved on to wikidata. I doubt that they ever repeated the matching and believe that "wikipedia" linking in VIAF still reflects the original matching. Therefore in a sense there is no need to compare wikidata to VIAF again. Unless of course this can magically restricted to VIAF numbers which never appeared in en.wikipedia. -- Gymel (talk) 12:58, 27 January 2015 (UTC)
- Some remarks:
New VIAF Situation
editHave you read this?
Yes, this is a very nice announcement since it means that VIAF will source Wikidata actively, which hopefully will close the gap between the two. I think this makes it even more important to leverage VIAF and other authority IDs in Wikidata. --Vladimir Alexiev (talk) 09:57, 29 May 2015 (UTC)
Update Mar 2017:
- discussion on facebook Wikidata.GLAM group about VIAF coverage of different kinds of authors.
- Out of 1,02M links, 883k are for people:
select (count(*) as ?c) {
?x wdt:P214 ?viaf
filter exists {?x wdt:P31 wd:Q5}
}
- My guesstimate is that out of 4.5M humans on WD, half are in VIAF. So we only have 35% of the possible links. --Vladimir Alexiev (talk) 14:57, 26 March 2017 (UTC)
From Wikipedia to Wikidata Link
editThe 2019-01-07 dump of Viaf links contains 1,519,171 unique Viaf IDs that have at least one link to a Wikipedia page (information extracted with gunzip -kc viaf-20190107-links.txt.gz | awk -F '\t' '/wikipedia/ {print $1}' | sort | uniq -c | wc -l
). However, Wikidata has currently only 1,352,478 items with a VIAF property.
We could therefore add 166,693 Viaf IDs relatively easily - unless they are largely related to Wikipedia pages without equivalent in Wikidata, which would be interesting. Of course, it is still necessary to figure out which ones are already in Wikidata. What's the best workflow for this kind of verification? Ettorerizza (talk) 01:39, 14 January 2019 (UTC)
- Hi @Ettorerizza, Epìdosis, Bargioni: On 22 Nov 2019 Bargioni imported 570k VIAF ids from viaf-20191104-links. Thus the number of P214 statements on Wikidata grew over 30% to 2,035,798 (see Property talk:P214#Recent synchronisation). Ettorerizza, I think your proposal should be refined as follows: 1. Find VIAF clusters with a Wikipedia link but without Wikidata link. 2. Use the Wikidata site links to find the respective Wikidata item and add a VIAF id. If you do 1, I'll help you to do 2, perhaps with a series of SPARQL queries --Vladimir Alexiev (talk) 09:16, 27 November 2019 (UTC)
ID Proliferation with VIAF
editSee the low WD coreference percentages in #VIAF Volumetrics. I believe the situation is not much different for all VIAF-contributing sources.
What can we do to increase them? There is a simple way: proliferate IDs between all of these authorities.
- Example 1: take http://viaf.org/viaf/156527943/#Museum_het_Rembrandthuis_(Amsterdam,_Netherlands). It has 16 links on VIAF, but only 9 of them are on WD https://www.wikidata.org/wiki/Q277316. We can easily copy the missing links to WD, with this reference: source: VIAF, reference URL: http://viaf.org/viaf/156527943.
- (On the other hand, WD has 7 IDs that are not subject to VIAF: Europeana, Encyclopedia Britannica, Atheneum, Twitter, etc)
- Example 2: take http://www.wikidata.org/entity/Q203266 (Titios Painter). WD knows its ULAN id and since all of ULAN is in VIAF, it's easy to find its VIAF ID: http://viaf.org/viaf/96438055/#Tityos-Maler_etruskischer_Vasenmaler_des_schwarzfigurigen_Stils_(%22Pontische_Vasen%22)
- It's a shame that VIAF knows its WD id but not vice versa!
- WD knows 1.4k ULAN without VIAF. By proliferating the VIAF and VIAF-contributor ids only of these people, we'll probably gain 100k coreferences!
--Vladimir Alexiev (talk) 16:15, 22 August 2018 (UTC)
- @Vladimir Alexiev: I get a "query is malformed" error on your ULAN without VIAF... - PKM (talk) 21:01, 22 August 2018 (UTC)
- @PKM: fixed. And somehow the number now is lower?? --Vladimir Alexiev (talk) 08:03, 23 August 2018 (UTC)
- @Vladimir Alexiev: Thanks. Maybe someone jumped in to fix some? It wasn’t me (yet). - PKM (talk) 19:15, 23 August 2018 (UTC)
- @PKM: fixed. And somehow the number now is lower?? --Vladimir Alexiev (talk) 08:03, 23 August 2018 (UTC)
- @Vladimir Alexiev: I get a "query is malformed" error on your ULAN without VIAF... - PKM (talk) 21:01, 22 August 2018 (UTC)
VIAF members
edit#All properties representing VIAF members
SELECT ?id ?idLabel ?cod
WHERE {
?id wdt:P31 wd:Q55586529 ;
p:P1552 [ ps:P1552 wd:Q26921380; pq:P3295 ?cod ] .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?cod
Swedish National Library moves direction Linked data
editMonday june 11 the Swedish National Library is moving in direction Linked data and release a new system called LIBRIS XL ==>
- We have a new identifier LIBRIS-URI Libris-URI (P5587) in Wikidata
Open issues are if we need to "upgraded" to support the new identifier in other systems. I have found
- User:Magnus_Manske/authority_control.js
- tool to read from VIAF.org and populate WD
- User:Magnus_Manske/Mix'n'match_date_import
- wdmapper
See also:
- BIBFRAME in Libris XL
- Datamodel chosen by The National Library of Sweden (KB)
- Presentation from 2017 Bibframe in an European perspective
- Salgo60 (talk) 14:07, 10 June 2018 (UTC)
- Status update 16 September 2018
- A project wmse-library-data-2018 done by Alicia_Fagerving_(WMSE) is converting and adding about 60 000 identifiers to Wikidata....
- I sent in today a question to VIAF asking about plans they have to start supporting Libris-URI (P5587) - Salgo60 (talk) 06:54, 16 September 2018 (UTC)
- Status update 28 September 2019
- Still no one has specified how LIBRIS <-> VIAF should work together see Phabricator T223259
- We lack a change process for errors in VIAF and LIBRIS see example of error Talk:Q21522286#Summary
- - Salgo60 (talk) 18:32, 28 September 2019 (UTC)
- Status update 28 September 2019
Introducing mapping relation type (P4390) for full SKOS mappings
editFor knowledge organization systems which do not deal with clearly defined entities (such as human (Q5)), you often find in-exact relations between the external entity and matching Wikidata items. E.g., the STW concept Yugoslavia (until 1990) is not an exact match to Yugoslavia (Q36704), which is described as "1918–1992 country in Southeastern and Central Europe". For STW's Executive selection, assessment centre (Q265558) is closely related, but categorically different (process vs. instrument). While in the latter case it might be useful to create an exactly matching item, in the former it clearly would be not.
Within the domain of traditional knowledge organization system, different mapping relations (such as "close match") have been used to cover such situations. The "external id" properties of Wikidata lacked such expressiveness, till the mapping relation type (P4390) property was introduced. The property, to be used exclusively as a qualifier on external-id properties, allows to more precisely define relations as exact match (Q39893449), close match (Q39893184), broad match (Q39894595), narrow match (Q39893967) or related match (Q39894604). These relation types reflect the according SKOS mapping properties.
Since its introduction in October last year, the property has seen some uptake, particularly in the biomedial field. The ongoing mapping of STW Thesaurus for Economics (Q26903352) to Wikidata is based on qualified relations (state of the mapping, shown as SKOS). The effort aims at creating a finally complete mapping of the STW descriptors.
Workflow for the mapping process
editOur - still experimental - workflow for the STW mapping is as follows:
- Use Mix-n-match catalogs for each sub-thesaurus (#507, #1259/#1260) to assign STW descriptor IDs one-by-one to Wikidata items.
- This sounds simple, yet often reveals qualitiy issues. Some of these, such as obvious duplicates in Wikidata, can be resolved immediately. Sometimes an ugly mess shows up, which can only be reported to be solved by the community later on. Of course, quality flaws in STW might be revealed also. During this step, it may also be advisable to take notes about other items, which are not the closest ones and not linked via external-id entry in Wikidata, but may be worth linking from the side of the external vocabulary.
- For non-exact relationships, immediately open the newly linked item and manually qualify STW Thesaurus for Economics ID (P3911).
- For STW descriptors which are lacking a counterpart in Wikidata and would make sense there, add an item semi-automatically, with exact match (Q39893449) to the STW ID and all avaialable information from the thesaurus (more on that later).
- Assign exact match (Q39893449) to the remaining unqualified STW Thesaurus for Economics ID (P3911) via Quickstatements, with the input produced by a script executing a SPARQL query. This step can be executed multiple times during the mapping process for one sub-thesaurus, in order to keep the list of unqualified entries short.
The sequence within the Mix-n-match input file turned out to be crucial for a smooth one-by-one workflow. We sorted the generated M-n-m input by the minimal notation of attached subject categories for the descriptors, and within that alphabetically.
Quality control for mappings using mapping relation types
editMaintenance and qualitiy control on a mapping have to take into account that multiple external-id values for a Wikidata item, or one external-id linked to multiple Wikidata items are possible and may perfectly make sense with in-exact mapping relations (e.g., STWs Appenzell is a "broad match" to Appenzell Ausserrhoden (Q12079) and Appenzell Innerrhoden (Q12094)). This is not reflected in the "single value" and "distinct values" constraints. Therefore, we defined a number of QS reports for the STW mapping to catch anomalies specifically in qualified mappings.
Feedback welcome
editWe are interested to exchange experiences with others who are mapping KOS to Wikidata, possibly with different workflows or other tools used. The reports and scripts linked above are meant to be customizable, we'd be happy to receive suggestions for improvment or github pull requests. -- Jneubert (talk) 15:09, 29 August 2018 (UTC)
- Notified participants of WikiProject KOS
Item creation from a thesaurus concept via Quickstatements
editDuring the above mentioned mapping process from STW Thesaurus for Economics (Q26903352), we sometimes want to create new items in Wikidata. With the "New item" button in Mix-n-match, only rudimentary information can be transferred to the new item. Therefore we generated a list of all not-yet-mapped STW descriptors, formatted for Quickstatements input. It includes labels and aliases (skos:preLabel/skos:altLabel) in all available languages, as well an instance-of economic concept (Q29028649) statement, sourced from the STW descriptor, optionally a link to the according GND concept, derived from the STW/GND mapping, rarely a description (skos:scopeNote), but always of course a STW Thesaurus for Economics ID (P3911) link.
The workflow is - during working through a mix-n-match list - for any missing concept simply to copy & paste the complete set of QS statements into the QS input window, removing aliases which are not appropriate for Wikidata (such as "oil platform" for "offshore industry"), and running the statements. The list is sorted exactly like the mix-n-match list and recreated every hour, so the same items are on top of both lists, and every case solved by either linking or creating an item disappears automatically from the list. In our experiences, this works quite smoothly.
If others want to adapt such a workflow for other vocabularies - thanks to SKOS and LOD standards that shouldn't be too difficult -, here is the script for generating the list, and the query called with it. -- Feedback, as always, welcome. Jneubert (talk) 12:11, 5 September 2018 (UTC)
History
editPlease add here references, blogs etc on the topic.
https://twitter.com/hashtag/coreferencing: tweet using tag #coreferencing. Tweets on involving Getty, British Museum thesauri, some fancy shots...
- 201809: Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
- 201711: Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers by Jneubert
- 201611: Wikidata and Persistent Identifiers Presentation by Arthur Smith at PIDapalooza 2016
- 20150327: Starting in Apr 2015, VIAF will transition from English Wikipedia coreferencing to Wikidata coreferencing. As a result it will pick up a lot more multilingual labels, 700k persons and 300k organizations that don't occur in English Wikipedia. In Name Data Sources for Semantic Enrichment I argued that VIAF and Wikidata have few names in common: I am glad that this development will quickly bridge the gap. http://outgoing.typepad.com/outgoing/2015/03/moving-to-wikidata.html
- 20150325: Our presentation proposal with Europeana accepted for Glam-Wiki 2015
- 20150324: WikiProject Authority control (#wikidata #coreferencing) to be highlighted by User:Multichill at GlamWiki 2015
- 20150227: Wikidata as linked data authority for Europeana: Presentation proposal "Wikidata, a target for Europeana's semantic strategy?" to GLAM-WIKI 2015
- 20150207: ODI Culture Challenge proposal GLAM-WIKI on Steroids: not well written, wasn’t sucessful
- 201502: project announced: Wikidata:Project_chat/Archive/2015/02#Wikidata weekly summary #143
- 201501: project proposed: Wikidata:Project_chat/Archive/2015/01#WikiProject Authority Control?
- 201307 Authority Addicts: The New Frontier of Authority Control on Wikidata Wikimania 2013
- meta:Grants:IdeaLab/Countering systemic bias through Wikidata authority control (ideas by User:Superm401)