Wikidata:WikiProject Chemistry/ChemID

Home

 

Guidelines

 

Properties

 

References

 

Tools

 



ChemID Initiative



Goals edit

The ChemID Initiative aims to compare the different free databases about chemicals in order to match their corresponding IDs and to use Wikidata as connection center between the databases data sets.

Introduction edit

Different databases list chemicals according to different criteria and identify the chemicals in their database by an internal identifier. These identifiers are often used outside of their original databases in order to allow a better identification of the chemicals between the different databases.

Most databases try to integrate some identifiers of other databases in order to offer links between data sets dispatched in the different databases but this is not performed in a systematic way: for example database A adds identifier of database B and C, database B integrates identifier of databases C and D, database C adds identifier of databases E, F and G,...

Wikidata can be a central point of connection between the databases and the data they store by listing in each item of a chemical the list of all databases identifiers.

How to achieve this goal ? edit

Most of databases are open data and free (but we have to check the licence to see to which extend we can import the IDs) and even propose the access to their data in a standard format allowing the data reading by machine (xml, SDF,...):

These data sets contain for each chemical different data: internal identifier, identifiers of other databases, chemical formula, InChI, InChIKey, SMILES, CAS number, IUPAC name,... All these data can be used in order to match the data sets of the databases and to find the corresponding sets of an unique chemical.

To avoid copyright problem, the data extraction/processing should be done outside of Wikidata. Only the final result should be loaded into WD once the final comparison is done according to copyright from each database.

Process definition edit

  1. first step: Create a list of the Q number of Wikidata items defining chemicals with one identifier. The best compromise is to used InChIKEy/InChI as primary ID because these IDs rely on chemical structure.
    • This query allows to extract all items defined as instance of chemical compound AND having a defined InChIKey statement:
    SELECT * WHERE {
      ?compound wdt:P31 wd:Q11173 ;
                wdt:P235 ?inchikey 
    }
    
    Try it!
    Be careful: this list has some groups of chemical defined as instance of chemical substance instead of subclass of chemical substance This should be corrected later in the process of ID comparison.
    Be careful: some chemicals having a defined statement InChIKey but no statement instance of chemical compound are not in the list.
    Status: curation is on-going to avoid duplicates (several items having the same InChIKey) or bad definition (one item having several InChIKey). See the items requiring curation here.
  2. Second step: Download the IDs from the databases cited above and store all this data in an unique format.
    Status: Open
  3. Third step: Compare for each chemical the IDs extracted from the databases. For each ID of each chemical a evaluation mark has to be given indicating if the ID is the same in the different databases.
    Status: Open
  4. Fourth step: For IDs presenting a low ranking, a curation should be done based on information from the external databases. The goal is to understand why different values for one ID representing a chemical are used between different databases.
    Status: Open
  5. Fifth step: ID with high ranking as well as curated IDs can be uploaded into WD.
    Status: Open

Resources edit

  • Contributors ready to check data from wikipedias and to identify clearly the chemical described by a Wikidata item
  • Persons with programming skills to handle data from databases and to create a super database
  • Bot operators ready to import data from the super database into Wikidata

Critical criterion for programming is to have routines which can be repeated frequently in the future: by curating data we should be able to provide list of entries in each database needed a correction.