Wikidata:Property proposal/Natural Product Atlas ID
Natural Product Atlas IDEdit
Originally proposed at Wikidata:Property proposal/Natural science
|Description||A link from wikdata entries to the external chemical curation database, the Natural Product Atlas|
|Data type||External identifier|
|Example 1||sirolimus (Q32089) → NPA000414|
|Example 2||Geldanamycin (Q904475) → NPA019914|
|Example 3||Difficidin (Q58371294) → NPA019912|
|Example 4||nystatin A1 (Q27292191) → NPA020315|
|Example 5||Kedarcidin (Q15426249) → NPA020328|
|Planned use||For all NPAtlas compounds that have existing Wikidata entries (as determined by NPAtlas->PubChemSID->PubChemCID-> WD Entity links), I will update the entries with an NPatlas tag. Compounds with this tag are also known to be a natural product (natural product (Q901227)) and to have a producing organism (this taxon is source of (P1672)).|
|Number of IDs in source||25000|
|Expected completeness||eventually complete (Q21873974)|
|Robot and gadget jobs||can be created|
|See also||this taxon is source of (P1672) natural product (Q901227)|
The Natural Product Atlas (The Natural Products Atlas (Q78224032)) is a curatorial effort that has annotated information on ~25000 small molecules produced by organisms in nature. The data has been released as a CC-attribution resource that is downloadable on their website. Many of these compounds exist in Wikidata and can be linked via a UID - either a PubChem CID or an InChIKey (or both). Furthermore, each NPAtlas compound is registered in PubChem with a Substance ID. Therefore, believe this is a useful tag that will facilitate a more dense linking of compounds to their producers in nature.
While I believe the addition of this tag is useful and desirable, there are a few challenges facing the broader issue of incorporating natural product data into wikidata that should be addressed.
1. Groups of related compounds. It is common to have a series of related compounds that are mostly similar to one another. For example, NPAtlas contains a number of entries with minor variants - say Spumigin A - Spumigin F. We may wish to link only to one member of the series.
2. What is the best way to link a compound to NPAtlas ID? Names can be ambiguous but even unique identifiers like InChIKeys can get you into trouble. For example, the Wikidata entry for Verruculogen, 10,10a-dihydroxy-7-methoxy-2,2-dimethyl-5-(2-methyl-1-propen-1-yl)-1,10,10a,14,14a,15b-hexahydro-12H-3,4-dioxa-5a,11a,15a-triazacycloocta[1,2,3-lm]indeno[5,6-b]fluorene-11,15(2H,13H)-dione (Q11954479), links to Pubchem CID 104862 while the NPatlas SID for Verruculogen, 386992827 , links to Pubmed CID 13887805. These two Pubmed-validated compound-IDs refer to the same compound but only one has assigned stereochemistry. In this case, a naive script that checks Wikidata for the existence of the named entity "Verruculogen" would find an entity but that entity would have a conflicting InChIKey; a search for matching InChIKeys (linked PubChem CID) would indicate the compound is not in wikidata.
If we take into account issues 1) and 2) the safest way to assign NPAtlas IDs is to only apply it to the subset of current Wikidata entities that have matching names, PubChem CIDs and InChIKeys. This will be considerably less than the current full set of ~25k compounds.
- Question are we sure we can legally import the data to WD? Data: CC-BY, WD: CC-0? Wostr (talk) 18:24, 10 December 2019 (UTC)
- Matching using labels (names) is not acceptable at all; many items has already different names than its PubChem equivalents (and most will have to have different names, e.g. because of the fact that labels in WD should not be capitalised and PubChem names are capitalised). So I would leave labels/names alone and avoid any matching on that basis. The best solution is to match compounds using more than one identifier, but in the case only one can be used, using InChI/InChIKey would be the best option. Unfortunately, most of WD items about chemical compounds has not been manually curated yet and during mass imports of chemical compounds to WD no one cared about properly classify them into even basic classes (i.e. distinguish ions and neutral compounds; distinguish compounds with fully defined isomerism and compounds with undefined isomerism). That and other problems may lead to situations like in 10,10a-dihydroxy-7-methoxy-2,2-dimethyl-5-(2-methyl-1-propen-1-yl)-1,10,10a,14,14a,15b-hexahydro-12H-3,4-dioxa-5a,11a,15a-triazacycloocta[1,2,3-lm]indeno[5,6-b]fluorene-11,15(2H,13H)-dione (Q11954479) and similar problems, where some identifiers does not exactly match the item. However, matching InChI/InChIKey+PubChem CID would be the best option here. Wostr (talk) 18:44, 10 December 2019 (UTC)
- Support in general. This may be a source to populate this taxon is source of (P1672) like Zcp3000 stated above, but also DOI could be matched to existing items in WD and described by source (P1343) could be added. The question remains, can this database be imported to WD (different licenses)? Wostr (talk) 21:06, 10 December 2019 (UTC)
- And answer to #1: this should be 1:1 relation. If there is an entry in NPAtlas for a group of compounds, then WD item about group of compounds should link to that entry. Wostr (talk) 21:11, 10 December 2019 (UTC)
- BTW 10,10a-dihydroxy-7-methoxy-2,2-dimethyl-5-(2-methyl-1-propen-1-yl)-1,10,10a,14,14a,15b-hexahydro-12H-3,4-dioxa-5a,11a,15a-triazacycloocta[1,2,3-lm]indeno[5,6-b]fluorene-11,15(2H,13H)-dione (Q11954479) has been fixed (there is now verruculogen (Q78086174) and 5-epi-verruculogen (Q78085453), probably a few more items could be created for different stereoisomers and compounds without fully defined stereochemistry). Wostr (talk) 21:14, 10 December 2019 (UTC)
- Support David (talk) 07:08, 11 December 2019 (UTC)
- Support --Egon Willighagen (talk) 19:32, 12 December 2019 (UTC)
- Support. YULdigitalpreservation (talk) 12:12, 20 December 2019 (UTC)