Wikidata:WikiProject Biodiversity/Agassiz urchin fossil cast collection import


Agassiz urchin fossil cast collection import project
Import of structured data and pictures of Collection of sea urchin fossils casts created by Louis Agassiz (Q121092336) into Wikidata and Wikimedia Commons

Institution: Natural history museum of Neuchâtel (Q3330885)

Commissioned by: Wikimedia CH (Q15279140) (Contact: Flor WMCH)

Contractors: Luca Martinelli (user:Morpiz) and Léa Lacroix (user:Auregann)

Timeframe: July-December 2023


Project summary edit

The Museum of Natural History in Neuchâtel wanted to import a set of pictures related to 664 urchin fossils casts and their related metadata on the Wikimedia projects. In the frame of this project, we imported the pictures on Wikimedia Commons, together with the related metadata, stored in a linked data format and connected appropriately to newly created Wikidata items. On top of files import on Commons, this project required to analyze, clean and reconcile the data with existing data on Wikidata, with the input from the museum, to create missing entries on Wikidata (specific entries for the fossils, entries about the species, bibliographical references) and to improve the existing data model of paleontology on Wikidata if needed.

The project was commissioned by Wikimedia CH and takes place on July-December 2023.

  • Data and files import, OpenRefine: Luca Martinelli (user:Morpiz)
  • Coordination, contact with the museum, documentation: Léa Lacroix (user:Auregann)
  • Contact at Wikimedia CH: Flor Méchain (User:Flor WMCH)

The project took place in several phases:

 
Conoclypus anachoreta FOS 2440 - 1
  • Analyze, clean and refine the data to get it ready to import on Wikidata and Wikimedia Commons (structured data).   Done
  • Analyze and improve the existing content on Wikidata. Create new entries (fossils, species, bibliographical references) to enrich the data. Improve the data model on paleontology if required.   Done
  • Communicate with the Museum to transmit questions, issues with the data and requests for clarification.   Done
  • Provide a test sample for validation by the museum: cast item and file on Commons with structured data.   Done
  • Create a Commons template that would pull and display data from Wikidata, according to requirements from the Museum.   Done
  • Import the previously cleaned and refined content: files on Wikimedia Commons, and the related structured metadata on Wikimedia Commons and Wikidata.   Done
  • Prepare visualizations to give an overview on the imported content and allow monitoring and maintenance.   Done
  • Deliver the documentation of the process with a description of the various steps of the project.   Done

Files on Commons edit

Part of the project took place on Wikimedia Commons, with import of files and metadata, and creation of a Wikidata-powered template for fossils.

Queries and visualisation edit

 
Map of specimen's discovery places color-coded by geological period

Documentation edit

In this section, we are going more into details about how we analyzed, refined and imported the content, in an attempt to provide tips and advice to people who will work on a similar import project in the future. Prerequisites:  We mostly used OpenRefine to clean, refine and import the data and files. This section does not include the basics on how to use OpenRefine, but you can find a short presentation of the tool in video here, as well as a detailed tutorial here. We also found this presentation focusing on importing files on Commons directly from OpenRefine very useful.

 
Cyathocidaris avenionensis FOS 2658 - 1

Analyze, clean and refine the data edit

Cleaning up and reconciliating data was in fairness the biggest part of the project. There are a number of potential takeaways from this part:

  • Check data for completeness
    • Places of discovery and age to which the fossils are dated weren’t always included in the first batch of data, so we asked for an integration. This allowed us to recover all temporal data, and almost all places of discovery.
    • Only when there is absolutely no way to trace this data, you can put “unknown” as value.
  • Double check the data you’re reconciling
    • Especially places of discovery were tricky to reconciliate because the names were in French and/or had small mistakes in them.
    • One solution is to reconciliate data directly in the language they are in (just add the manifest link with the proper language), but a second round of checking against external sources is always preferable with places and names especially.
    • Ask for help to the original provider of data if there is some disambiguation needed: they (should) know their data better than anyone else.
  • Always consider Wikidata's notability criteria when doing the reconciliation
    • Some of the places of discovery were just not notable enough for an item to be created, so we escalated to the immediately higher level of subdivision available (for example, “craie de Morée” was reconcilied to Morée (Q389621))
  • Ask the relevant project and/or other users for help
    • If you have trouble deciding how to model certain aspects of your work, ask for help from other users. This will save you precious time.
    • There are also the Telegram channels for both Wikidata and OpenRefine in case you need help.
  • If data is split in several columns, try to condensate it into one before reconciling
    • In other words, create a new column in OpenRefine and populate it with the data of the other columns. This can be done through “Edit column” → “Join columns”, selecting all columns that apply, and setting a new column for the result.
    • This step will save you time when reconciling data, since you’ll just have to clean up and reconcile one column instead of several.
    • This will also save you time when you’re going to upload the data: having just one combination of columns to go through, instead of six or seven combinations, makes you go faster.
    • Do not remove the original columns. They can always be useful for disambiguation or to double check data.
    • This also works the other way round: if you need to split data, you can do so through “Edit column” → “Split into several columns”, setting the character(s) that will serve as separator, and setting the names of the new columns for the results.

Improve the existing content on Wikidata edit

263 new items about missing species on Wikidata were created. This was, of course, a necessary step to take when reconciling the data about the fossils’ species. Same applies to the bibliographical references that were later included in the uploads as sources to the statements.

Most of the takeaways from the previous section apply here, but there are a couple more that might be interesting:

  • If the data you’re working with is complex, split the work in several stages
    • For example, for this import of data, we started by creating the missing items about species, references, types of fossils, and all the other necessary items to complete the reconciliation process, then we proceeded with a second stage related to specimens, and then a third stage related to the actual fossil casts. Only when all the data was uploaded, we then moved to upload the pictures.
  • Follow the community guidelines about how to create new items
    • For example, if you want to create an item about a bibliographical reference, follow the guidelines in Wikidata:WikiProject Books
    • When in doubt, ask the community about how you should proceed.

Import the content on Wikimedia Commons and Wikidata edit

The import of data happened in several instances, depending on the kind of data that was to be uploaded. Most of the takeaways from the previous sections apply here, but there are a couple more that might be interesting:

  • Remember to save a local copy of the data model on OpenRefine
    • You can do so by clicking on “Save new” at the end of the line that says “Start from an existing schema:”.
    • This is useful especially when you switch the Wikibase instance you’re working with (i.e. from Wikidata to Wikimedia Commons), since switching will clear your model.
  • Use direct upload through OpenRefine instead of exporting to QuickStatements
    • Direct upload is preferable if the user is not an administrator on Wikidata and/or Wikimedia Commons, for two main reasons:
      • It automatically reconciliate the items created to the value in the table;
      • It supports the creation of statements with more than one source.
  • Be aware of potential rate limits imposed by the project in uploading
    • If you reach the limit of edits imposed by the system, do not interrupt the upload, because it won’t automatically reconciliate the items that it is creating, and you’ll be forced to do that by hand.
    • The limit for uploading files on Wikimedia Commons through OpenRefine is ~370 files every 72 minutes. If you plan on uploading more than that quantity, split the upload in several batches of ~100/150 files each and account for 30 minute pauses every 2-3 batches uploads.

Discussions edit

Questions, suggestions, issues? Feel free to write on the talk page!