Wikidata:Property proposal/Natural Product Atlas ID

Natural Product Atlas ID edit

Originally proposed at Wikidata:Property proposal/Natural science

DescriptionA link from wikdata entries to the external chemical curation database, the Natural Product Atlas
Data typeExternal identifier
Domainproperty
Example 1sirolimus (Q32089)NPA000414
Example 2geldanamycin (Q904475) → NPA019914
Example 3Difficidin (Q58371294) → NPA019912
Example 4nystatin A1 (Q27292191) → NPA020315
Example 5kedarcidin (Q15426249) → NPA020328
Planned useFor all NPAtlas compounds that have existing Wikidata entries (as determined by NPAtlas->PubChemSID->PubChemCID-> WD Entity links), I will update the entries with an NPatlas tag. Compounds with this tag are also known to be a natural product (natural product (Q901227)) and to have a producing organism (this taxon is source of (P1672)).
Number of IDs in source25000
Expected completenesseventually complete (Q21873974)
Formatter URLhttps://www.npatlas.org/joomla/index.php/explore/compounds#npaid=$1
Robot and gadget jobscan be created
See alsothis taxon is source of (P1672) natural product (Q901227)

Motivation edit

The Natural Product Atlas (Natural Product Atlas (Q78224032)) is a curatorial effort that has annotated information on ~25000 small molecules produced by organisms in nature. The data has been released as a CC-attribution resource that is downloadable on their website. Many of these compounds exist in Wikidata and can be linked via a UID - either a PubChem CID or an InChIKey (or both). Furthermore, each NPAtlas compound is registered in PubChem with a Substance ID. Therefore, believe this is a useful tag that will facilitate a more dense linking of compounds to their producers in nature.

While I believe the addition of this tag is useful and desirable, there are a few challenges facing the broader issue of incorporating natural product data into wikidata that should be addressed.

1. Groups of related compounds. It is common to have a series of related compounds that are mostly similar to one another. For example, NPAtlas contains a number of entries with minor variants - say Spumigin A - Spumigin F. We may wish to link only to one member of the series.

2. What is the best way to link a compound to NPAtlas ID? Names can be ambiguous but even unique identifiers like InChIKeys can get you into trouble. For example, the Wikidata entry for Verruculogen, verruculogen (undef. stereochem.) (Q11954479), links to Pubchem CID 104862 while the NPatlas SID for Verruculogen, 386992827 , links to Pubmed CID 13887805. These two Pubmed-validated compound-IDs refer to the same compound but only one has assigned stereochemistry. In this case, a naive script that checks Wikidata for the existence of the named entity "Verruculogen" would find an entity but that entity would have a conflicting InChIKey; a search for matching InChIKeys (linked PubChem CID) would indicate the compound is not in wikidata.

If we take into account issues 1) and 2) the safest way to assign NPAtlas IDs is to only apply it to the subset of current Wikidata entities that have matching names, PubChem CIDs and InChIKeys. This will be considerably less than the current full set of ~25k compounds.


Discussion edit

  Notified participants of WikiProject Chemistry

@ديفيد عادل وهبة خليل 2, Zcp3000, Wostr, YULdigitalpreservation, Egon Willighagen:   Done: Natural Product Atlas ID (P7746). − Pintoch (talk) 17:53, 30 December 2019 (UTC)[reply]