Wikidata talk:WikiProject Biodiversity/Agassiz urchin fossil cast collection import


Agassiz urchin fossil cast collection import project
Import of structured data and pictures of Collection of sea urchin fossils casts created by Louis Agassiz (Q121092336) into Wikidata and Wikimedia Commons

Institution: Natural history museum of Neuchâtel (Q3330885)

Commissioned by: Wikimedia CH (Q15279140) (Contact: Flor WMCH)

Contractors: Luca Martinelli (user:Morpiz) and Léa Lacroix (user:Auregann)

Timeframe: July-December 2023


edit

Hello everyone, we have a few questions related to the preparation of the museum's urchin fossils casts dataset and reconciliation on OpenRefine. As we are not experts of modeling fossils in Wikidata, your feedback would be more than welcome :)

  1. In the data file, on top of the scientific name, we have a column "scientific name authorship", how do you recommend modeling it on Wikidata?
  2. We also have a column "old determination", which often contains a scientific name different from the main one. The museum indicated that it's either a historic error of identification, or that the specie changed name over time, or even both, but the museum doesn't know what is the case for each fossil. What's the best way to proceed with it, can we find a standard way to model it?
  3. We need to understand how to model types. We have holotypes, historic holotypes, syntypes, historic syntypes, illustrated, mentioned, or other type. Would you have examples on how to model each of those cases?

Many thanks in advance! Auregann (talk) 11:43, 2 August 2023 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── There are three things to model; the object, the specimen of which it is a cast, and the taxon. Most of what you mention above is about the taxon. If the cast is made from a specimen that was a type (of any of the above kinds); that is data about the specimen. So you have [cast] -> is a cast of [specimen] -> is a type or example of [taxon].

I'm not clear what you mean by "illustrated" or "mentioned"

You may also find Wikispecies' glossary useful. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:31, 2 August 2023 (UTC)Reply ──────────────────────────────────────────────────────────────────────────────────────────────────── Thanks a lot @Pigsonthewing: for your answers!

  • Object vs specimen: do you mean that we should have two different items, one for the object (cast) and one for the specimen? On the dataset that we got provided, both things are merged. Is it really worth creating 2 times more items?
  • Here are the descriptions from the museum about illustrated and mentioned:
    • Illustrated: The sample is illustrated (with an image) in a scientific publication without being a type.
    • Mentioned: the sample is mentioned in a publication (without image).

I hope that makes it clearer. Auregann (talk) 16:21, 2 August 2023 (UTC)Reply

Yes, two items, one for the object (cast), one for specimen. Consider the (hypothetical?) case of a specimen, from which two or more museums each have a cast.
Or, by way of analogy, imagine you had been given a data set about portrait paintings, which included the birth and death dates of the subjects.
In the case of illustrated and mentioned, those would be statements on the items about the publication. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:28, 2 August 2023 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── Hi @Pigsonthewing:, many thanks for your reply!

In an attempt of visualizing better the data that we have and how to model it, we created this drawing. Could you have a look and let me know if it makes sense and if our proposals of statements are correct?

While working on it, we came up with a few more questions:

  1. What’s the right property to connect the cast and the specimen?
  2. Where do we put class, order and family? Does it go on the specimen or on the taxon? And which properties should we use?
  3. What should we do if the class, order and family of the taxon mentioned in the dataset differs from what’s on Wikidata?
  4. I'm still not sure how to model the field "old determination" that we have in the dataset. Since we don't have any information about the context (renamed specie, error of identification) we would like to find a generic property to add on the specimen item. What do you think?
  5. About illustrated and mentioned: You suggest to add statements on the publication item, but shouldn’t we put a "cited in" statement on the specimen item instead? (to see all the specimens cited in a publication, we can easily create a SPARQL query)

Many thanks again for your precious help! Auregann (talk) 17:11, 3 August 2023 (UTC)Reply

────────────────────────────────────────────────────────────────────────────────────────────────────

To answer your Qs first:

  1. depicts (P180)
  2. On the item about the taxon - see WikiProject Taxonomy for examples and documentation
  3. A question for the Taxonomy wikiproject (I'm sure examples exist, but can't think of any right now)
  4. Ditto
  5. You could do, but for some that could be a very long list

Your diagram is generally very good. You can add media legend as a qualifier to pictures of casts. Maybe you can also add "inception" dates? Under "Specimen", be aware that "discoverer or inventor" is often not (but can be) the "Scientific name authorship name"; likewise "time of discovery", which may be years or decades before the specimen is identified or named. It would be good to be more specific, if you can, under "label".

Don't forget, also, that your images of casts can be added, if needed, to items about specimens, and taxons (including the higher taxon ranks), with suitable media legends. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:41, 3 August 2023 (UTC)Reply

────────────────────────────────────────────────────────────────────────────────────────────────────

Thanks @Pigsonthewing:, once again your answers are very helpful!

I asked our questions on the Taxonomy project. To be honest, we don't necessarily want to do a lot of edits on taxon items and mess up with the carefully curated dataset on Wikidata, we will simply try to reconcile the specimen data from the museum with Wikidata taxa as good as we can.

For the list of publications, we have 4 publications in the dataset, so it's going to be fine. We noted the media legend for pictures on the specimen items. Unfortunately, we don't have inception dates (in that case, the date on which the cast was created).

For the scientific authorship name and date, since we have only one column in the museum's file, that is not necessarily well structured to extract the information, we are thinking of following the example of holotype of Ouratea sipaliwiniensis (Q55200035) and add discoverer or inventor (P61)unknown valueobject named as (P1932)"(Goldfuss, 1826)". Would that work? If I understood correctly, the date mentioned in this field is about when the scientist named/identified the specimen, but you are right, the time of discovery of the fossil is somethinf different, and unfortunately we don't have that information.

We will definitely add a picture on the specimen items we will create, but I think we will let the editors who are more expert than us on taxonomy judge if it is relevant to add them on some taxon items. Auregann (talk) 15:40, 5 August 2023 (UTC)Reply

All of these suggestions seem sensible and reasonable; except for your comparison with Q55200035; in that case the statement and qualifier refer to the specimen, not a cast. I think the best thing you could do at this stage is upload one example of a cast image, and create the items about that and for its specimen, then we can see which properties fit and gain consensus. You can put all your data on the talk page(s), as text. Also CC User:Ambrosia10 and User:Dshorthouse, who each have experience of modelling biological specimens. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:29, 5 August 2023 (UTC)Reply
Thanks! Yes, we plan to create an example as soon as we received answers to our last questions from the museum. I'll keep you updated here :) Auregann (talk) 09:07, 9 August 2023 (UTC)Reply

Test the modelling: example of files and items

edit

Hello @Pigsonthewing, @Ambrosia10 and all,

Following your recommendation, we took a small sample in our dataset and created a test, so you can check it and let us know if the use of the properties and the modelling meet the community standards.

Here are the two items we created: Pygorhytis ringens (FOS-2117) (Q122459532) and specimen of Pygorhytis ringens (Q122459533). You can find the pictures in the related category on Commons. Note that we added some structured data, but not yet a Wikidata-powered template - we'll work on it as soon as the data structure will be confirmed. What do you think? Is there anything in the modelling we should improve?

While doing the test, we also stumbled upon a few more issues:

  1. We couldn't find an item about fossil cast, to add as a value of "instance of" on the cast item. Do you know of one we could use? If not, should we create one with instance of (P31)cast (Q12042203)?
  2. On specimen of Pygorhytis ringens (Q122459533) we wanted to put a shortlist of the publications in which the fossil is cited, but we couldn't find the right property. "Stated in" should not be used in a statement. Do you have another suggestion? Or maybe having the publications as references of "instance of" -> holotype (or specimen) is enough?
  3. Could you explain again why we need to use discoverer or inventor (P61)unknown valueobject named as (P1932)"(Agassiz, 1836)"? Is that because we're not sure who's the inventor, based on the data?

While continuing to match the full dataset in OpenRefine, we also came up with some questions:

  1. More than 500 of the species present in the Museum's dataset cannot be matched with any Wikidata item. You can see them in this document, with the number of occurrences in the dataset. We're not sure what to do with those. Should we batch create the taxon items?
  2. In some cases, the specie is indicated with a question mark at the end. According to the Museum, it means that they are not 100% sure it's the right specie. How do you suggest we model this? Is there for example some qualifier that would express the degree of uncertainty, or should we only put "unknown value"?

Many thanks in advance for your answers! Best, Luca & Auregann (talk) 06:46, 14 September 2023 (UTC)Reply

────────────────────────────────────────────────────────────────────────────────────────────────────

The two items look good to me. The only issue I can see is on Q122459533, where you have

discoverer or inventor: unknown value; object named as - (Agassiz, 1836)

Is that a reference to Louis Agassiz (Q122972)?

In answer to your first three questions (which I have numbered for convenience):

  1. Create one, but make it "subclass of"
  2. The latter is enough, in my view
  3. This is relevant to my above concern; you don't need to use the qualifier. If Agassiz found the specimen, use his item - if not, just "unknown value".

In answer to your second set of two questions (also numbered):

  1. I would create items for these, though anticipate that some may be duplicates due to typographical or transcription errors. Taking one random example, Caryocrinus ornatus, Google finds hits for the taxon. In an ideal world you'd check all 500, but I realise that may be impractical. Perhaps a sample of, say, 50 could be checked? At the least, use your project as a source, so the set can be queried.
  2. Yes, you can use sourcing circumstances (P1480) with a suitable value.

-- Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:18, 14 September 2023 (UTC)Reply

@Pigsonthewing hey, I went on and created the missing taxons. This is just one example, Hemipatagus hofmanni (Q122966461), that I modeled after Bald Eagle (Q127216). I can assure that almost all the taxon I created have some sort of backup on relevant databases, and I did my best to include all the data that the Neuchâtel Museum gave us. You can check all the recent uploads at Special:Contributions/Morpiz. I hope I did everything right, but please let me know if I have to change/correct anything. Thanks! Morpiz (talk) 14:09, 11 October 2023 (UTC)Reply
@Morpiz: I checked a few. Generally they look good, but on Clypeus solodurinus (Q122967060) you have the authority as "Louis Agassiz (Q122972), 1836", whereas [1] has it as "Wright, 1852". Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:30, 11 October 2023 (UTC)Reply
@Pigsonthewing Thanks. I don't know what to do with this discrepancy in data. How do you usually model conflicting information? Morpiz (talk) 09:34, 12 October 2023 (UTC)Reply
@Morpiz: Add both and mark one as preferred, or the other deprecated, with a qualifying reason. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:06, 12 October 2023 (UTC)Reply

Commons template

edit

Hello all,

Now that we have reviewed and validated the data model, we would like to create a nice template for the Commons file that would take data from the cast and specimen items. @Jarekt generously accepted to help us and create a template based on Artwork.

This template will be fed with two variables: the Q-ID of the cast item (for example Pygorhytis ringens (FOS-2117) (Q122459532)) and the Q-ID of the specimen item (for example specimen of Pygorhytis ringens (Q122459533)). The template can be tested on c:File:Pygorhytis ringens FOS 2117 - 1.jpg.

Template label Template value Comment Example Status
Title (at the top of the template box) File name without the prefix and suffix, linking to the cast item on Wikidata Pygorhytis ringens (FOS-2117)     
Type status Value of the statement of instance of (P31) on the specimen item "Type status" is not a Wikidata property, we'll need to provide translations. holotype (Q1061403)    still need label
Type specimen Value of of (P642), a qualifier or the statement instance of (P31) on the specimen item, in italic, plus the value of discoverer or inventor (P61) of the specimen and the value of the qualifier time of discovery or invention (P575), in brackets Translations can be taken from type specimen (Q51255340) Pygorhytis ringens (Louis Agassiz, 1836) to update
Family Starting from the specimen value (see line above), climbing the ontology through parent taxon (P171) until we find the one that has taxon rank (P105) -> family (Q35409) Translations can be taken from family (Q35409) Pygorhytidae (Q22348453)    added to Type specimen field
Class Starting from the specimen value, climbing the ontology through parent taxon (P171) until we find the one that has taxon rank (P105) -> class (Q37517) Translations can be taken from class (Q37517) sea urchin (Q83483) to add to type specimen field
Order Starting from the specimen value, climbing the ontology through parent taxon (P171) until we find the one that has taxon rank (P105) -> order (Q36602), if exists. If none is entered, leave blank. Translations can be taken from order (Q36602) ? to add to type specimen field
inventory number (P217) Content of the statement inventory number (P217) on the cast item The collection value could also be in a separate field if it's easier. FOS-2117 (Collection of sea urchin fossils casts created by Louis Agassiz (Q121092336))   
Source Value of owned by (P127) on the collection item, linked from collection (P195) on the cast item Similar to the existing source field on Artwork Natural history museum of Neuchâtel (Q3330885)   
Age Value of time period (P2348) on the specimen item, displaying age if available Translations can be taken from age (Q568683). This value may be blank if the value of time period is a higher rank. Bajocian (Q375180)   
Epoch Starting from value of time period (P2348) on the specimen item, displaying epoch if available Translations can be taken from epoch (Q754897). We may have to climb in the ontology of the age to find this value. Middle Jurassic (Q500054)
Period Starting from value of time period (P2348) on the specimen item, displaying period if available Translations can be taken from period (Q392928). We may have to climb in the ontology of the age to find this value. Jurassic (Q45805)
location of discovery (P189) Value of location of discovery (P189) on the specimen item, if available Solothurn (Q11929)   
Country Starting from value of location of discovery (P189) on the specimen item, displaying country if available Translations can be taken from country (Q6256). We may have to climb in the ontology of the location to find this value. Switzerland (Q39)    added to location of discovery field
discoverer or inventor (P61) Value of discoverer or inventor (P61) on the specimen item Value only, the qualifier will be in the field below Louis Agassiz (Q122972)   
time of discovery or invention (P575) Value of time of discovery or invention (P575), a qualifier of discoverer or inventor (P61) on the specimen item 1836   
Bibliographical references Values of described by source (P1343) on the specimen item "Bibliographical references" is not a Wikidata property, we'll need to provide translations. Nouveau catalogue des moules d'echinides fossiles du musée d'histoire naturelle de Neuchâtel (Q122453997) (page 118)

Catalogus systematicus ectyporum echinodermatum fossilium Musei Neocomensis : secundum ordinem zoologicum dispositus (Q56656203) (page 2)
Catalogue raisonné des Echinides. Catalogue raisonné des espèces, des genres et des familles d'echinides (Q122455168) (page 33)
Synopsis des échinides fossiles (Q51376230) (page 207)

   uses "references" field
Permission {{Muséum d'histoire naturelle de Neuchâtel}} Identical to Artwork   

If you have any suggestions or questions, or if anything is unclear, please let me know. Thanks again for your help! Luca & Auregann (talk) 15:34, 4 October 2023 (UTC)Reply

@Jarekt: Update: I added/edited a few lines based on the Museum's request. If anything is unclear, let me know. Thanks, Auregann (talk) 13:56, 6 October 2023 (UTC)Reply
@Auregann: I began writing c:Module:Custom fossil, I should have something working in couple days. The code I am writing, will be mostly applicable to your data collection and I am not sure what to call the template. Any suggestions? --Jarekt (talk) 01:42, 16 October 2023 (UTC)Reply
I do not think the content of "Bibliographical references" shout come from statement references. I think that in addition to statement references you should also have them in described by source (P1343) or published in (P1433). --Jarekt (talk) 03:21, 16 October 2023 (UTC)Reply
@Jarekt: I agree that it's pretty specific to this project and the requirements from this specific museum. How about "Template:Neuchâtel fossil cast" ?
Noted. We will look at how we can adapt the data model. Auregann (talk) 08:23, 16 October 2023 (UTC)Reply
@Jarekt: Alright, we added described by source (P1343) to the specimen item, so you can take the bibliographical references from there. Does that work for you? Auregann (talk) 07:39, 17 October 2023 (UTC)Reply

@Auregann:, I have initial version of c:Template:Neuchâtel fossil cast with some testing at c:Module_talk:Neuchâtel_fossil_cast/testcases--Jarekt (talk) 03:46, 19 October 2023 (UTC)Reply

Hi @Jarekt:, thank you so much! I see that there are two different versions of the template, one with the cast Q-ID and one with the specimen Q-ID. What we intended was to combine both in one template, because in the end on the Commons file page we need to display data about both things.
The code for the template would look more like:
 {{#invoke:Neuchâtel fossil cast|fossil
 |specimen=Q122459533
 |cast=Q122459532
 |lang=en
 |permission={{Muséum d'histoire naturelle de Neuchâtel}}
 }}

And would combine data from both in one table. Does that make sense?
Thank you so much again for your work! Auregann (talk) 09:53, 20 October 2023 (UTC)Reply
Another thing I noticed: for data such as a Time Period, you only included the age (Bajocian). Although this makes sense from a Wikidata perspective (we only indicate the most precise level of information we have), the paleontologists have guidelines about how to present the data, and they want all the levels (age, epoch, period) displayed below the picture of the fossil. Same for the location and the specie/family. Do you think you could make it happen exactly the way I presented it in the table above? Thanks a lot! Auregann (talk) 10:29, 20 October 2023 (UTC)Reply
@Auregann:, I missed the nuance about information from 2 Wikidata items, but it does make sense. It should be hard to modify. --Jarekt (talk) 02:49, 21 October 2023 (UTC)Reply
@Jarekt: Many thanks for your work! <3 The template looks really fancy now.
How can we make the epoch and period appear? I tried to look at the code and see how you did it for the family and the country, but couldn't figure it out ^^
For Type Status, we can already enter a translation in a few languages, where should we store it? Auregann (talk) 06:46, 23 October 2023 (UTC)Reply
@Auregann:, Type Status translations can be found at c:Module:I18n/artwork. Is that biological or archeological term? Internet search shows that it means a lot of different things to a lot of people and I do not know how to translate it to Polish. As for epoch and period they do seem to work, I just put them in a single field. File:Pygorhytis ringens FOS 2117 - 1.jpg also shows an issue with additional "description" field. I can fix it if you are planning to use it, but if not than I would leave it as is. --Jarekt (talk) 00:43, 1 November 2023 (UTC)Reply

Hello @Jarekt:, many thanks for building this great template. We received very positive feedback from people at Wikimedia CH involved in several data import projects with the Neuchâtel museum!

We also received some requests for improvements from the Museum. We are of course trying to find a good compromise between the way the Wikidata and Commons communities like to organize and display the data, and the norms that scientists use for decades or centuries to describe information :)

  1. The main point that the museum asked us to change is the field "Type specimen". They would like to have the name of the species in italic, and to have the name of discoverer and date of discovery afterwards in brackets, to get it as closed as possible to the scientific norm, which would be Pygorhytis ringens (Agassiz, 1836). I updated the table with a description and example here.
  2. The second thing they would like is to add the class and order on top of the family, in a similar way than what you did for the age, epoch and period. I also added it to the table. I'm aware that "climbing the ontology" is tedious to do, and I noticed with our example that sometimes there's no order in the Wikidata ontology, which is fine, we can leave it blank.

Could you make these to edits to the template in the upcoming days?

Thank you so much for your help on this project, we really appreciate the time you spent in tweaking the template to bring the best of Wikimedia and museum world together :) Auregann (talk) 13:31, 14 November 2023 (UTC)Reply

I will work on implementing the suggestions Jarekt (talk) 04:41, 16 November 2023 (UTC)Reply
Hello @Jarekt:,
As we're starting importing the files on Commons, we encountered an issue on some files, for example c:File:Acrocidaris formosa FOS 2626 - 1.JPG. There's a Lua error instead of the template, which may be caused by the fact that in the taxon ontology, the "family" is not present. We are not able to fix the data for now, as we do not have the necessary information from the museum.
Could you look at the error ? If it is indeed caused by missing data, could you implement an error catch that would simply not display the part about family if this data is not found? I guess this should be implemented as well for the other parts of the template where it browses the ontology to find something.
Many thanks in advance, Auregann (talk) 13:52, 30 November 2023 (UTC)Reply
Auregann, Will do, and I still have to get started with implementing suggestions. Jarekt (talk) 14:25, 30 November 2023 (UTC)Reply
Return to the project page "WikiProject Biodiversity/Agassiz urchin fossil cast collection import".