Wikidata:Property proposal/Universal Spectrum Identifier

Universal Spectrum Identifier edit

Originally proposed at Wikidata:Property proposal/Natural science

DescriptionThe Universal Spectrum Identifier (USI) is a compound identifier that provides an abstract path to refer to a single spectrum generated by a mass spectrometer, and potentially the ion that is thought to have produced it.
RepresentsThis property would allow to link a mass spectra to a chemical compound (Q11173). Complementary to SPLASH identifiers (SPLASH (P4964))
Data typeExternal identifier
DomainA Wikidata property. QIDs chemical compound (Q11173), chemical entity (Q43460564) or pure substance (Q578779) are types of items that could bear this property.
Allowed valuesLocal Unique Identifier (LUI) pattern ^mzspec:.+$
Example 1cocaine (Q41576)mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000211412
Example 2erythromycin (Q213511)mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00006685311
Example 3glucagon (Q170617)mzspec:PXD013100:Diabetes_iPSC_Beta12_5_05Sep14_Alder_14-08-24.mgf:index:40476:HSQGTFTSDYSKYLDSRRAQDFVQWLMNT/3
Sourcehttps://www.psidev.info/usi and https://registry.identifiers.org/registry/mzspec
Planned useMatching chemical compounds to MS or MSMS spectra
Formatter URLhttps://metabolomics-usi.gnps2.org/dashinterface/?usi1=$1 http://proteomecentral.proteomexchange.org/usi/?usi=$1

Motivation edit

Matching chemical compounds to MS or MSMS spectra. USI are complementary to SPLASH because they are human-readable and "use a concatenated multi-part key that specifies the collection, mass spectrometry run and index information needed to locate a particular mass spectrum in a repository, defining keys that researchers can easily compose without requiring any special hashing algorithms." [1] GrndStt (talk) 16:12, 30 October 2022 (UTC)[reply]

  1. Universal Spectrum Identifier for mass spectra (Q114953382)

Discussion edit

Pinging

  Notified participants of WikiProject Chemistry GrndStt (talk) 16:20, 30 October 2022 (UTC)[reply]

  Support Interesting idea! Egon Willighagen (talk) 04:25, 31 October 2022 (UTC)[reply]
One small note in case you all move forward (happy to support integration with WikiData), the Metabolomics Spectrum Resolver will be doing a slight migration from UCSD domain name to a more non-university tied one to a separate organization. Will update you all going forward. Mingxun.Wang (talk) 15:57, 1 November 2022 (UTC)[reply]

As preliminary note, I am in favor of such mapping. Its success will, however, depend on the quality of our mapping and precision of the description, therefore a few points we should try to answer:

- In "planned use", "representative" MS/MS spectra is written. Who will determine it? How many can be representative? 1,5,100? The paper mentions billions of spectra, how will import of "representative" spectra be limited? Are the ionization conditions another dimension of representativeness?

- "MS/MS" spectra does not seem correct, I think it can be any spectrum.

- I see no notes about adduct type, ionization mode, instruments, etc. I do not see USI as other "classical"external chemical identifiers. Most of the identifier mappings refer to the chemical itself, not an "artefact" of it. So, even if technically it is an external ID...I am not sure about the correct way to model it on WD.

- Do anyone know if curation can happen in an automatic way? Let's say, by matching the exact mass/InChIKey between WD and any retrievable info on the the other side? Can we query mzspec for chemicals present in WD?

My five cents, AdrianoRutz (talk) 06:50, 31 October 2022 (UTC)[reply]

- In "planned use", "representative" MS/MS spectra is written. Who will determine it? How many can be representative? 1,5,100? The paper mentions billions of spectra, how will import of "representative" spectra be limited? Are the ionization conditions another dimension of representativeness?
This is a tough one. And at the moment AFAIK we don't have a way to produce a consensus spectral representation. So here representative was in the sense of "an examplar spectrum". I understand it is a confusing adjective and will simply remove it for clarity. Regarding the numbers, 40-50000 unique compounds with at least a spectrum in all public databases would be a broad estimate. An in the range of 2000 spectra per compound for the very "popular" compound in spectral databases. Again, a wild guess. It would worth doing some stats on https://gnps-library.ucsd.edu/. So overall in the range of what we have for LOTUS data.
- "MS/MS" spectra does not seem correct, I think it can be any spectrum.
Agreed. Changed to MS and MSMS (let's forget MSn MSe and all the rest for now ...)
- I see no notes about adduct type, ionization mode, instruments, etc.
I guess, just like in the case of the found in taxon (P703), were experimental information regarding the isolation and structural determination approach is to be found in the supporting reference; these informations should be fetched in the spectral database linked. USI has a specific layer for spectral interpretation which is specified here https://psidev.info/proforma, it is however oriented for peptides interpretation but as one can see here mzspec:PXD013100:Diabetes_iPSC_Beta12_5_05Sep14_Alder_14-08-24.mgf:index:40476:HSQGTFTSDYSKYLDSRRAQDFVQWLMNT/3, charge state can be encode at least. Would be indeed super cool to have the possibility to specify spectral information for molecules using SMILES, InChI or CXSMILES for example for partially defined structures (for example an adduct were the counterion could not be placed)
– I do not see USI as other "classical"external chemical identifiers. Most of the identifier mappings refer to the chemical itself, not an "artefact" of it. So, even if technically it is an external ID...I am not sure about the correct way to model it on WD.
Me neither. However as you say it is an external ID. Just like PDB structure ID (P638) is not the structure itself but structure + protein it's in the external id section (see caffeine (Q60235)). In fact if we think about an InChIKey (P235) it doesn't refer to the "chemical itself" neither. Its a hash of an InChI (P234) which is the encoding of a structure coming from the interpretation of a spectra.
So I thought the USI this should be a property an appear in the Identifiers section of a chemical but this might not be the best way indeed. How – if it should – should this be done then ?
- Do anyone know if curation can happen in an automatic way? Let's say, by matching the exact mass/InChIKey between WD and any retrievable info on the the other side? Can we query mzspec for chemicals present in WD?
At https://gnps-library.ucsd.edu/ we can at least have the InChIKey so matching with WD compounds should be straightforward. However I don't see SPLASH or USI in the downladable tables. I will have a look around and check with Ming Wang else, he should have a way around ! For the last point regarding direct query of USI their might be something to do with th spectral interpretation layer (see point 3. above) but I dont think its implemented for small molecules at the moment ... GrndStt (talk) 08:56, 31 October 2022 (UTC)[reply]
@AdrianoRutz, Mingxun.Wang:, would you like to give your opinion? Regards, ZI Jony (Talk) 06:25, 25 January 2024 (UTC)[reply]
Sorry for the late reply. @GrndStt addressed most of my concerns and having this property will be useful.
  Support AdrianoRutz (talk) 07:31, 8 February 2024 (UTC)[reply]
  Neutral. I agree with Adriano that the number of IDs (=spectra) for a compound is huge. This should be avoided in WD at all costs for technical reasons. However, the issuing authority (proteomexchange?) should have some means to list all IDs for a given InChi key and give a permanent URL for that list, and WD should have a means to link to that URL. This is the only way I see. --SCIdude (talk) 08:35, 31 October 2022 (UTC)[reply]
On the GNPS side, we can build a mechanism to return a list of USIs for a given InChIKey query. Then for each compound, you could link out to the MS/MS instances. Happy to provide as much of data dumps to do the analysis of coverage and scale of multiple spectra per compound. Mingxun.Wang (talk) 15:54, 1 November 2022 (UTC)[reply]