User:SCIdude/Imports

TCDB

assocation of protein with TCDB leaf node

source: http://www.tcdb.org/download.php, full set: http://www.tcdb.org/public/tcdb
type of data: tcdb.org associates single proteins or groups (all orthologs probably) to TCDB leaf nodes which have a full 5-part ID. They provide protein FASTA files of all so associated proteins that contain also the UniProt (or rarely RefSeq) and the TCDB ID
process: we will extract (TCDB, UniProt) ID tuples and create claims from them; this works because UniProt IDs are distinct and the identifier of choice in general for proteins. First from a subset (1.A = 1739 proteins) then the full set (around 16k). Later syncing will need more sophisticate approach.
extract with:

sed 's/^>gnl|TC-DB|//g' tcdb-*.txt |sed '/^......[A-Z]/d'|sed 's/^\(......\s[A-Z0-9.]\+\)\s.*/\1/g' |sed '/^[A-Z]\+$/d' >tcdb.all.txt

actually only part of proteins are in WD: 726/1739 (42%) for 1A, 6035/16506 (37%) for all
claims to be created:
- item--Transporter Classification Database ID (P7260)--TCDB ID
  - REF stated in (P248)--Transporter Classification database (Q142667)
  - retrieved (P813)--today

from sys import *
import csv

reader = csv.DictReader(open('uniprot-wd.tab', 'r'), delimiter='\t')
unips = {}
dups = set()
for item in reader:
    uid = item.get('unip')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    git = unips.get(uid)
    if git is None or git == it:
        unips[uid] = it
    else:
        #print('more than one value: {} ({}, {})'.format(uid, git, it))
        dups.add(uid)
for k in dups:
    unips.pop(k)

uids = set(unips.keys())
tups = []
for line in stdin.readlines():
    l = line.rstrip()
    u = l[:6]
    if u in uids:
        tups.append((u, l[7:]))

print(len(tups))
for tup in tups:
    print('{}|P7260|"{}"'.format(unips.get(tup[0]), tup[1]))

batches: #18065, #18066, #18067, #18072
categories should have mapping relation type (P4390)--exact match (Q39893449)

Missing gene/disease associations

Statements to create: (gene) "genetic association" (disease) --- (reference) "stated in"-->"Uniprot" --- retrieved-->"2019-08-13" --- "UniProt protein ID"-->id
Source: UniProt query organism:"Homo sapiens (Human) [9606]" with columns: Entry name, Gene name (primary), Protein names, Involvement in disease
WD data needed: human UniProt <--> gene item mapping

SELECT ?uid ?gid ?gLabel
WHERE
{
  ?item wdt:P703 wd:Q15978631 .
  ?item wdt:P352 ?uid .
  ?item wdt:P702 ?gid .
  ?gid rdfs:label ?gLabel .
  FILTER( LANG(?gLabel)="en" ) .
}

WD data needed: diseases with OMIM ID (exact match):

SELECT ?item ?itemLabel ?omim
WHERE
{
  ?item p:P492 ?claim .
  ?claim ps:P492 ?omim .
  ?claim pq:P4390 wd:Q39893449 .
  MINUS { ?item wdt:P31 wd:7187 } .
  ?item wdt:P31/wdt:P279* wd:Q12136 .
  ?item rdfs:label ?itemLabel .
  FILTER( LANG(?itemLabel)="en" ) .
}

Script to produce QS batches: https://gist.github.com/rwst/7e7218533eca5235419db2a878164b07
Jobs: #17773, #17788, #17789, #17883
Finish date: 2019-Aug-29

missing PMID

32147628 31978293 32127517 32151274 32151334 32035511 32036774 32148173 32035018 32141588 32156329 32050059 32050060 32065221 32077441 32135586 31997390 32132521 32043982 32135587 32102621 32174267 32135585 32077440 32102777 32138488