User:SCIdude/Modeling

This page describes the data models existing in the molbio part of Wikidata, focusing on the most canonical one for each class. This in turn defines what is a bug, or which statements are missing on items of that class. The most detailed database for the class also should map to the canonical model (databases of subsets may be more detailed, they would define a different class however, a subclass of the more general, e.g. transporter family and peptidase family as subclasses of enzym family).

EnzymesEdit

Enzymatic activityEdit

Enzyme familyEdit

  • by searching for inst-of-enzymes without any UniProt, RefSeq, encoded, or taxon we caught >4200 enzyme families with en-WP+ sitelinks --> made them inst-of-enzyme family with description "class of enzymes". There are now 4865 enzyme families.
  • removed statements "found in taxon [P703] Homo sapiens" from 64 families
  • next is to give them all a molecular function statement to enable grouping in their own hierarchy. After assigning functions that had identical wording + "activity", 3354/4863 families had a function.
  • of the >1500 families without function 1502 had an EC, there are 3 cases:
    • the EC appears once in GO ---> function is found
    • the EC appears more than once in GO ---> function is the parent of all others in the list (CAVEAT: inconsistencies in GO, issues on Github)
    • not in GO, probably obsolete in GO, try next higher EC (e.g. 4.3.2 instead of 4.3.2.1)
  • finding >120 protein families, subclass enzyme, all enzyme families. There are now 4982 enzyme families.
  • list all items and relevant fields:
 ?p wdt:P31 wd:Q67015883 .
 OPTIONAL { ?p p:P591 ?stmt .
   ?stmt ps:P591 ?ec .
   OPTIONAL{ ?stmt pq:P4390 ?mr . }
   }
 OPTIONAL{ ?p wdt:P279 ?q . }
  • to be able to build the subclass tree we need the information which families exactly correspond to higher nodes (1., 1.2, and 1.2.3); this needs to be done manually
  • the list of higher EC nodes where we don't have a family item:
 SELECT ?p ?ec ?pLabel
 WHERE
 {
   ?p wdt:P31 wd:Q14860489 .
   ?p p:P591 ?stmt .
   ?stmt ps:P591 ?ec .
   ?stmt pq:P4390 wd:Q39893449 .
   FILTER ( STRENDS(STR(?ec), '.-') ).
   MINUS { 
     ?q wdt:P31 wd:Q67015883 .
     ?q wdt:P680 ?p .
   }
   SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" . }
 }
  • somehow we didn't count items without EC, There are now 5045 enzyme families.
  • the tree of intermediate family nodes that have an exact EC:

Root

├── 1 oxidoreductase (Q407479)
│   ├── 1 alcohol oxidoreductase (Q4713306)
│   └── 12 hydrogenase (Q424135)
├── 2 transferase (Q407355)
│   ├── 1 
│   │   ├── 1 methyltransferase (Q415875)
│   │   └── 4 amidinotransferase (Q68688747)
│   ├── 3 acyltransferases (Q2609152)
│   ├── 4 glycosyltransferases (Q67201373)
│   │   └── 1 hexosyltransferase (Q5749058)
│   ├── 6 
│   │   └── 1 transaminase (Q424288)
│   ├── 7 
│   │   └── 6 Diphosphotransferase (Q5279763)
│   └── 8 
│       ├── 2 sulfotransferase (Q175950)
│       └── 3 CoA-transferase (Q68689639)
├── 3 hydrolase (Q96286)
│   ├── 1 esterase (Q418750)
│   │   ├── 1 Carboxylesterase (Q409840)
│   │   ├── 3 phosphatase (Q422476)
│   │   ├── 4 phosphoric diester hydrolase (Q67202883)
│   │   └── 2 thioesterase, subgroup (Q7784664)
│   ├── 2 glycosidase (Q13527914)
│   │   └── 1 Glycoside hydrolase superfamily (Q375795)
│   ├── 4 peptidase (Q212410)
│   │   ├── 22 cysteine protease (Q419343)
│   │   ├── 11 Aminopeptidase (Q419527)
│   │   ├── 21 serine endopeptidase (Q420032)
│   │   ├── 24 metalloendopeptidase (Q6822865)
│   │   ├── 17 Metalloexopeptidase (Q6822868)
│   │   └── 25 Threonine protease (Q7798075)
│   ├── 5 
│   │   ├── 1 amidohydrolases (Q4746164)
│   │   └── 2 amidohydrolases (Q4746164)
│   └── 6 
│       └── 4 helicase (Q138864)
├── 4 lyase (Q407727)
│   ├── 1 
│   │   └── 1 carboxy-lyases (Q417781)
│   └── 2 
│       └── 1 hydro-lyase (Q16915067)
├── 5 isomerase (Q118026)
│   └── 2 Cis-trans isomerase (Q5122112)
├── 6 ligases (Q410221)
└── 7 transport protein (Q2449730)
  • Script used to create hierarchy: https://gist.github.com/rwst/84b43461de6105dc4e0eda3bd1e0bd1c
  • from the queries in Protein_bugs#Stubs_from_early_days we collected more. There are now 5,197 enzyme families.
  • using IPR family GO annotations, name comparisons, and manual inspection lead to marking InterPro protein families as enzyme families. There are now 8,728 enzyme families. 8,721 of them have at least one molecular function link with mapping relation type broad/exact. 5,144 of these have an EC value, most without mapping relation type.

Membrane transporter familyEdit

  ?p wdt:P31 wd:Q67101749 .
  ?p p:P7260 ?stmt .
  ?stmt ps:P7260 ?tc .
  ?stmt pq:P4390 wd:Q39893449 .
  ?p wdt:P703 wd:Q15978631 .
  ?p wdt:P352 ?u .
  ?p wdt:P31 wd:Q8054 .
  ?p wdt:P279 wd:Q2449730 .
  MINUS { 
    ?p wdt:P7260 ?tc
  }

we blasted human proteins annotated as transporter but without TCDB against TCDB to get their classification, and to add their InterPro families as transporter families (>400 proteins)

  • there are now 659 membrane transporter families, 554 exactly corresponding to TCDB nodes
  • from the queries in Protein_bugs#Stubs_from_early_days we collected more. There are now 687 membrane transporter families.
  • Ideas
    • check all TCDB:8 proteins with transport annotation
    • if there is an enwiki article check for missing InterPro ID (P2926) links
    • TCDB overview of human transporters

Protein complexes et alEdit

Reactants/products/cargoEdit

GO data has the following types of mixin (in field intersection_of):

has_end_location
has_input*
has_intermediate*
has_output*
has_part
has_participant*
has_primary_input*
has_primary_input_or_output*
has_primary_output*
has_start_location
has_target_end_location
has_target_start_location

The starred ones can have CHEBI ids as argument.

input/output of GO processesEdit

chemical GO complex partsEdit

  • GO complexes can refer to ChEBI items

process descriptions (starts/ends with, has part)Edit

Protein families associated with domainEdit

  • Goal is to have a family item for every domain item. Which have none?
SELECT DISTINCT ?item1 ?ipr ?item1Label
{
	?item1 wdt:P31 wd:Q898273 .
    ?item1 wdt:P2926 ?ipr .
  MINUS {
    ?item2 wdt:P31 wd:Q81505329 .
    ?item2 p:P31 ?stmt .
    ?stmt pq:P642 ?item1
    }
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

Inhibitors, enzyme inhibitorsEdit

  • in May 2020 there are x items in WD

Request: move sitelinks gene-->proteinEdit

Simplified subset property for reasoningEdit