Wikidata:WikiProject Chemistry/ChemID

Home

Guidelines

Properties

References

Tools

ChemID Initiative

Goals

The ChemID Initiative aims to compare the different free databases about chemicals in order to match their corresponding IDs and to use Wikidata as connection center between the databases data sets.

Introduction

Different databases list chemicals according to different criteria and identify the chemicals in their database by an internal identifier. These identifiers are often used outside of their original databases in order to allow a better identification of the chemicals between the different databases.

Most databases try to integrate some identifiers of other databases in order to offer links between data sets dispatched in the different databases but this is not performed in a systematic way: for example database A adds identifier of database B and C, database B integrates identifier of databases C and D, database C adds identifier of databases E, F and G,...

Wikidata can be a central point of connection between the databases and the data they store by listing in each item of a chemical the list of all databases identifiers.

How to achieve this goal ?

Most of databases are open data and free (but we have to check the licence to see to which extend we can import the IDs) and even propose the access to their data in a standard format allowing the data reading by machine (xml, SDF,...):

PubChem : data to download (Sebotic's bot code base can handle this)
ChEMBL: data to download
ChEBI: data to download (this needs discussion: e.g. how to deal with the many ions)
KEGG: data to download (not open data)
CompTox Chemicals Dashboard data to download (public domain US government data)

These data sets contain for each chemical different data: internal identifier, identifiers of other databases, chemical formula, InChI, InChIKey, SMILES, CAS number, IUPAC name,... All these data can be used in order to match the data sets of the databases and to find the corresponding sets of an unique chemical.

To avoid copyright problem, the data extraction/processing should be done outside of Wikidata. Only the final result should be loaded into WD once the final comparison is done according to copyright from each database.

Process definition

first step: Create a list of the Q number of Wikidata items defining chemicals with one identifier. The best compromise is to used InChIKEy/InChI as primary ID because these IDs rely on chemical structure.
- This query allows to extract all items defined as instance of chemical compound AND having a defined InChIKey statement:
SELECT * WHERE { ?compound wdt:P31 wd:Q11173 ; wdt:P235 ?inchikey }
Try it!
Be careful: this list has some groups of chemical defined as instance of chemical substance instead of subclass of chemical substance This should be corrected later in the process of ID comparison.

Be careful: some chemicals having a defined statement InChIKey but no statement instance of chemical compound are not in the list.

Status: curation is on-going to avoid duplicates (several items having the same InChIKey) or bad definition (one item having several InChIKey). See the items requiring curation here.
Second step: Download the IDs from the databases cited above and store all this data in an unique format.
Status: Open
Third step: Compare for each chemical the IDs extracted from the databases. For each ID of each chemical a evaluation mark has to be given indicating if the ID is the same in the different databases.
Status: Open
Fourth step: For IDs presenting a low ranking, a curation should be done based on information from the external databases. The goal is to understand why different values for one ID representing a chemical are used between different databases.
Status: Open
Fifth step: ID with high ranking as well as curated IDs can be uploaded into WD.
Status: Open

Resources

Contributors ready to check data from wikipedias and to identify clearly the chemical described by a Wikidata item
Persons with programming skills to handle data from databases and to create a super database
Bot operators ready to import data from the super database into Wikidata

Critical criterion for programming is to have routines which can be repeated frequently in the future: by curating data we should be able to provide list of entries in each database needed a correction.