User:SCIdude/Imports
TCDB
editassocation of protein with TCDB leaf node
edit- source: http://www.tcdb.org/download.php, full set: http://www.tcdb.org/public/tcdb
- type of data: tcdb.org associates single proteins or groups (all orthologs probably) to TCDB leaf nodes which have a full 5-part ID. They provide protein FASTA files of all so associated proteins that contain also the UniProt (or rarely RefSeq) and the TCDB ID
- process: we will extract (TCDB, UniProt) ID tuples and create claims from them; this works because UniProt IDs are distinct and the identifier of choice in general for proteins. First from a subset (1.A = 1739 proteins) then the full set (around 16k). Later syncing will need more sophisticate approach.
- extract with:
sed 's/^>gnl|TC-DB|//g' tcdb-*.txt |sed '/^......[A-Z]/d'|sed 's/^\(......\s[A-Z0-9.]\+\)\s.*/\1/g' |sed '/^[A-Z]\+$/d' >tcdb.all.txt
- actually only part of proteins are in WD: 726/1739 (42%) for 1A, 6035/16506 (37%) for all
- claims to be created:
from sys import * import csv reader = csv.DictReader(open('uniprot-wd.tab', 'r'), delimiter='\t') unips = {} dups = set() for item in reader: uid = item.get('unip') iturl = item.get('item') it = iturl[iturl.rfind('/')+1:] git = unips.get(uid) if git is None or git == it: unips[uid] = it else: #print('more than one value: {} ({}, {})'.format(uid, git, it)) dups.add(uid) for k in dups: unips.pop(k) uids = set(unips.keys()) tups = [] for line in stdin.readlines(): l = line.rstrip() u = l[:6] if u in uids: tups.append((u, l[7:])) print(len(tups)) for tup in tups: print('{}|P7260|"{}"'.format(unips.get(tup[0]), tup[1]))
- batches: #18065, #18066, #18067, #18072
- categories should have mapping relation type (P4390)--exact match (Q39893449)
Missing gene/disease associations
edit- Statements to create: (gene) "genetic association" (disease) --- (reference) "stated in"-->"Uniprot" --- retrieved-->"2019-08-13" --- "UniProt protein ID"-->id
- Source: UniProt query
organism:"Homo sapiens (Human) [9606]"
with columns: Entry name, Gene name (primary), Protein names, Involvement in disease - WD data needed: human UniProt <--> gene item mapping
SELECT ?uid ?gid ?gLabel WHERE { ?item wdt:P703 wd:Q15978631 . ?item wdt:P352 ?uid . ?item wdt:P702 ?gid . ?gid rdfs:label ?gLabel . FILTER( LANG(?gLabel)="en" ) . }
- WD data needed: diseases with OMIM ID (exact match):
SELECT ?item ?itemLabel ?omim WHERE { ?item p:P492 ?claim . ?claim ps:P492 ?omim . ?claim pq:P4390 wd:Q39893449 . MINUS { ?item wdt:P31 wd:7187 } . ?item wdt:P31/wdt:P279* wd:Q12136 . ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" ) . }
- Script to produce QS batches: https://gist.github.com/rwst/7e7218533eca5235419db2a878164b07
- Jobs: #17773, #17788, #17789, #17883
- Finish date: 2019-Aug-29
missing PMID
edit32147628 31978293 32127517 32151274 32151334 32035511 32036774 32148173 32035018 32141588 32156329 32050059 32050060 32065221 32077441 32135586 31997390 32132521 32043982 32135587 32102621 32174267 32135585 32077440 32102777 32138488