Wikidata:Property proposal/GBIF occurrence ID

GBIF occurrence ID edit

Originally proposed at Wikidata:Property proposal/Sister projects

   Not done
DescriptionIdentifier for occurrences of species in Global Biodiversity Information Facility (Q1531570)
Data typeExternal identifier
Domain(formarly) living things
Allowed values[1-9]\d*
Example 1<https://commons.wikimedia.org/entity/M114906872> → 2350455454
Example 2<https://commons.wikimedia.org/entity/M114977180> → 1655936481
Example 3<https://commons.wikimedia.org/entity/M114725902> → 2831935646
Sourcegbif.org
Planned useAdd provenance to images uploaded in commons and references in Wikidata, similar to iNaturalist observation ID (P5683)
Expected completenessalways incomplete
Formatter URLhttps://www.gbif.org/occurrence/$1
Robot and gadget jobsOpenRefine, Tarsier, manually.
See alsoiNaturalist observation ID (P5683) GBIF taxon ID (P846)
Applicable "stated in"-valueGlobal Biodiversity Information Facility (Q1531570)
Wikidata projectWikidata:WikiProject Biodiversity

Motivation edit

GBIF or Global Biodiversity Information Facility is a free and open access portal to biodiversity data and images. Quite some of those images use a Wikimedia compatible license (CC0, CC-BY) which allows reuse. We would like to have a property to link to the original occurrence record to maintain the provenance in Commons and Wikidata. Andrawaag (talk) 22:47, 18 February 2022 (UTC)[reply]

Discussion edit

  •   Support Useful proposal Spinster 💬 22:50, 18 February 2022 (UTC)[reply]
  •   Support I'm supportive, but we should also add an example for how it can be used on Wikidata to make it easier for more users to understand. Ainali (talk) 08:36, 19 February 2022 (UTC)[reply]
  •   Comment the images I uploaded in 2019 from the GBIF dataset "Invertebrate Zoology Division, Yale Peabody Museum" seems to have stable occurrence IDs. E.g. the following files, the links to the occurences are provided in the source field of the file pages and have not changed since the time of the uploads:
c:File:Chelonibia testudinaria (YPM IZ 028450).jpeg350589333
c:File:Cypria petenensis (YPM IZ 005670.CR) 008.jpeg1039235957
Note that an additional potential useful use is to link type specimen items to their occurences in GBIF, e.g.
NHM 2011.2080 (Q54854611)1055358369
  Question how do we know that they " changes "a lot" "? Christian Ferrer (talk) 08:31, 20 February 2022 (UTC)[reply]
  •   Comment (I'm a software developer at GBIF.) GBIF is an index of over 65,000 datasets published by over 1,700 institutions, and updates from those publishers can vary significantly -- some datasets are essentially static, others have weekly or even daily changes.
When a dataset is newly published, or additional records added to an existing dataset, we generate new GBIF occurrence IDs for the new occurrence records. If records within a dataset are changed, we aim to keep the same GBIF occurrence ID. We use four fields within the dataset to do this (dwc:occurrenceID, dwc:institutionCode, dwc:collectionCode, dwc:catalogNumber). If these fields are not changed, the occurrence record will keep the same GBIF occurrence ID. If any of these fields are changed, the record may be assigned a new GBIF occurrence ID, and the old one deleted. (In some cases we can detect the change, and maintain the original IDs.) Additionally, if records are moved from one dataset to another, they will get new IDs.
We're currently working to improve ID stability as we recognize there are more users relying on the stability of these IDs, though this will not solve all cases.
Some publishers aim to keep identifiers within their record (dwc:occurrenceID, dwc:institutionCode, dwc:collectionCode, dwc:catalogNumber) stable, and others do not. Some keep them stable as a human sees it, but not stable as a machine sees it ("A 123" → "A123" → "http://occ.example.org/A/123").
M Blissett (talk) 11:26, 21 February 2022 (UTC)[reply]
@M Blissett Can you give an estimate on the number of GBIF occurrence ID changes, with respect to the stable ones. It is not specific to GBIF that identifiers do get obsoleted. I would argue that in most if not all databases this is the case. If the number of ID changes is low, I would still like to move forward with this property proposal. We need this property to provide attribution and provenance to the reused objects. Maybe it is even helping solving this issue, since commons will contain a copy of the digital object being referenced. GBIF might be able to use this linked data to create mappings to those cases where the identifier is reissued. Andrawaag (talk) 11:58, 21 February 2022 (UTC)[reply]
(I also work at GBIF)
Please also be aware of this issue which will bring the last known view to the tombstone page, even when records are removed.
I will try and resurrect some previous work I did that illustrates the extent of the changes.
Tim (trobertson@gbif.org) Timrobertson100 (talk) 13:12, 22 February 2022 (UTC)[reply]
@Andrawaag
To get a rough sense of ID stability, during 2021 GBIF added 285.4M records (to a total of ~1.9B) and issued 416M identifiers. This indicates 130.6M records lost their IDs or were removed, or ~ 6.8% of records Timrobertson100 (talk) 14:04, 22 February 2022 (UTC)[reply]

Example 1 and 3 can be argued to be examples of a different thing (preserved specimens) than species occurrences ("Occurrence", the target of the Occurrence ID). These (particular) specimens can rather be argued to be possible tokens of evidence for species occurrences. Why not simply find better examples in line with example 2?--Dag Endresen (talk) 07:29, 23 February 2022 (UTC)[reply]

The property name "GBIF occurrence ID" looks very similar to the Darwin Core term "occurrenceID". Values such as given in the examples (2350455454, 1655936481, 2831935646) would to me be better represented by a Wikidata property name such as "GBIF occurrenceKey" (see this link for more information) - to not be confused with the more widely known "Darwin Core occurrenceID"? In my opinion, two new Wikidata properties for the actual "Darwin Core occurrenceID" and the actual "Darwin Core materialSampleID" would be even more useful than the proposed "GBIF occurrence ID" property. --Dag Endresen (talk) 19:32, 26 August 2022 (UTC)[reply]

I agree that the current proposed name can be misleading. Most commonly, it seems to be referred to as either an occurrence key or as a gbifID. Could we list both clearly for this property?
I'm not so sure it is less useful than the Darwin Core dwc:occurrenceID and dwc:materialSampleID properties: Both of these are effectively up to data publishers to decide whatever their format and persistence might be. At least with the integer keys provided by GBIF, there is a system, a single data model and a single policy behind them and it seems massive gains have been made lately to increase their reliability and persistence. Maybe @Timrobertson100 could provide updated numbers? Matdillen (talk) 14:33, 12 May 2023 (UTC)[reply]