Wikidata:Property proposal/Ringgold identifier

Ringgold identifier edit

Originally proposed at Wikidata:Property proposal/Organization

   Done: Ringgold ID (P3500) (Talk and documentation)
Descriptionunique identifier for organisations in the publishing industry supply chain
RepresentsRinggold identifier (Q17016896)
Data typeExternal identifier
Domainorganisations
Allowed values\d{4,6}
ExampleWellcome Trust (Q326276) → 5072
SourceORCID API: https://members.orcid.org/api/tutorial-retrieve-data-using-public-api
Motivation

(Add your motivation for this property here.) GZWDer (talk) 18:23, 16 January 2017 (UTC)[reply]

Discussion
  •   Oppose - this is proprietary information and I do not believe it either belongs in wikidata or is legally allowed to be placed here. ArthurPSmith (talk) 19:41, 17 January 2017 (UTC)[reply]
  •   Support - ORCID iD (P496) are attributed to researchers, and it is possible to retrieve from the ORCID API a list of institutions a researcher is affiliated to. Very often, these institutions come with a Ringgold ID, so it would be useful to include these identifiers in the institution items. This would enable us create more links between researchers and institutions. I do not believe that using this source of Ringgold ids could be a legal issue as the ORCID data dump is released under a license that is compatible with CC0. Pintoch (talk) 18:14, 18 January 2017 (UTC)[reply]
  •   Support per Pintoch. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:52, 18 January 2017 (UTC)[reply]
  • Comment about the legal concerns: let me quote the Mix'n'Match FAQ: "Individual identifiers, such as numbers, can not be under copyright. If you are an institution based in Europe, the whole of your ID list may be under database copyright, but we are not copying the entire list in bulk; rather, volunteers add most of them individually, one at a time." I suppose that not all Ringgold identifiers are present in the ORCID dump, so doing an import from this source would not import the whole database. − Pintoch (talk) 18:59, 18 January 2017 (UTC))[reply]
  • Note: I am posting in my capacity as Wikimedian in Residence at ORCID. I have a statement from ORCID "Per our agreement with Ringgold, we are allowed to share the Ringgold identifiers and limited metadata (organization name, location) under CC0 license, just as the rest of ORCID data are available. We would not be using Ringgold otherwise. If someone gets a Ringgold ID out of ORCID, they are free to use it." Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 00:12, 19 January 2017 (UTC)[reply]
  • Excellent. Here is the kind of data we can extract from the ORCID 2016 dump (formatted for the Mix'n'Match tool):

I'll publish the dataset (15423 ids) and the code I used if this property is created. − Pintoch (talk) 00:15, 20 January 2017 (UTC)[reply]

Hmm, well, I have concerns about verifiability - what if some of the id's provided by ORCID are wrong? How would we know? I've worked with some datasets that had RInggold id's before where a substantial fraction (several percent) of the entered id's were incorrect - or at least disagreed between two comparable sources. But I suppose it's better than nothing and glad that they worked out that license agreement. ArthurPSmith (talk) 16:24, 23 January 2017 (UTC)[reply]
@ArthurPSmith: yeah, the ORCID dataset is quite noisy too. Unfortunately they have made UI design decisions that allow users to pollute the dataset with fake matches (see this GitHub issue). The excerpt from the dataset above contains matches by decreasing number of occurrences (so for the ones I have quoted, we can be sure these are the right identifiers). We can always confirm an ID by using the ORCID UI, adding an institution to a (fake) profile on sandbox.orcid.org, and checking which Ringgold id it gets (or calling manually the AJAX url that does the autocompletion there). That's very hacky and quite annoying, but I'm not aware of any other open data source to do that. If you still have access to your other datasets, do you think they could be used just to compare?
I wish ORCID exposed ISNI ids instead of Ringgold ids (since Ringgold seems to have aligned their own dataset with ISNI), because that would make all this a lot simpler… − Pintoch (talk) 17:39, 23 January 2017 (UTC)[reply]
the datasets I have access to with Ringgold information are explicitly NOT open or allowed to be used for other uses of this sort (per the license agreement with Ringgold when initially set up). Also they are quite small - just a few thousand institutions at most. And still plenty of errors. We simply had to drop the conflicting identifiers as we had no way to verify things once our Ringgold contract expired. But ISNI is hardly better - I've been comparing ISNI's between Orgref and GRID (both open datasets with tens of thousands of ISNI id's) and there's a lot of disagreements in that too. At least with ISNI you end up with a URL you can dereference to verify although often there's not much more than a name that might not actually resolve the issue. ArthurPSmith (talk) 23:37, 23 January 2017 (UTC)[reply]

Update: I have finally used the disambiguation dialog to circumvent the issue of fake matches introduced by users. It also improves coverage a lot. The dataset can be found at https://doi.org/10.5281/zenodo.268334 . The first half of it is on Mix'n'Match (where ISNIs are used, to leverage the existing statements for that identifier). I have pulled ISNI identifiers from GRID and VIAF. I will soon add Ringgold statements based on the existing ISNIs. − Pintoch (talk) 12:18, 3 February 2017 (UTC)[reply]