Wikidata:ScienceSource project/ScienceSourceIngest notebook
(Redirected from User:Charles Matthews/ScienceSourceIngest)
This page for the ScienceSource project covers the use of the SSIngest tool to create content on the ScienceSource wiki
See:
- Wikidata:ScienceSource project/ScienceSourceIngest dashboard for the table by specialty that once was here, explaining the breakdown used.
- Wikidata:ScienceSource project/Focus list additions for how the focus list, a staging area for ingestion, has been built up, using the queries found in that table.
Trials
edit- Sample disease query
#Diseases within sleep medicine
SELECT DISTINCT ?item
WHERE {
?item wdt:P31 wd:Q12136 .
?item wdt:P1995 ?medspec .
?medspec wdt:P361* wd:Q744029 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 10000
- Sample drug query
#Drugs used as treatment in sleep medicine
SELECT DISTINCT ?item
WHERE {
?item wdt:P31 wd:Q12140 .
?item wdt:P2175 ?condition .
?condition wdt:P1995 wd:Q744029 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 5000
Those types of queries can be converted into dictionaries by means of the aaraa tool.
- Sample ingest query
SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
WHERE {
?item wdt:P31 wd:Q7318358;
wdt:P5008 wd:Q55439927;
wdt:P932 ?pmcid;
wdt:P1433 ?journal;
wdt:P1476 ?title;
wdt:P577 ?date;
wdt:P275 ?license;
wdt:P921 ?mainsubject.
?mainsubject wdt:P1995 wd:Q744029.
MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
MINUS {?item wdt:P275 wd:Q6937225}
MINUS {?item wdt:P275 wd:Q19125045}
MINUS {?item wdt:P275 wd:Q24082749}
MINUS {?item wdt:P275 wd:Q6936496} #Remove NC license
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 300
- Sample ingest query for offset/limit batch
SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
WHERE {
?item wdt:P31 wd:Q7318358;
wdt:P5008 wd:Q55439927;
wdt:P932 ?pmcid;
wdt:P1433 ?journal;
wdt:P1476 ?title;
wdt:P577 ?date;
wdt:P275 ?license;
wdt:P921 ?mainsubject.
?mainsubject wdt:P1995 wd:Q788926.
MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
MINUS {?item wdt:P275 wd:Q6937225}
MINUS {?item wdt:P275 wd:Q19125045}
MINUS {?item wdt:P275 wd:Q24082749}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY
DESC(?date)
OFFSET 1000
LIMIT 500
Licensing
edit#Filter for most common forms of CC0, CC-by and CC-by-SA licenses
SELECT DISTINCT ?paper
WHERE
{
?paper wdt:P31 wd:Q13442814;
wdt:P31 wd:Q7318358;
wdt:P921 ?subject;
wdt:P1476 ?title.
{?paper wdt:P275 wd:Q6938433 } #CC0
UNION {?paper wdt:P275 wd:Q6905323} UNION {?paper wdt:P31 wd:Q20007257}
UNION {?paper wdt:P31 wd:Q14947546} UNION {?paper wdt:P31 wd:Q18810333} UNION {?paper wdt:P31 wd:Q19125117}
##CC-by, CC-by 4.0, CC-by 3.0, CC-by 2.5, CC-by 2.0
UNION {?paper wdt:P31 wd:Q6905942} UNION {?paper wdt:P31 wd:Q18199165}
UNION {?paper wdt:P31 wd:Q14946043} UNION {?paper wdt:P31 wd:Q19113751} UNION {?paper wdt:P31 wd:Q19068220}
##CC-by-SA, CC-by-SA 4.0, CC-by-SA 3.0, CC-by-SA 2.5, CC-by-SA 2.0
?subject wdt:P1995 ?spec.
?spec wdt:P361* wd:Q788926
}
Production runs
editNeurology
editQuery as for oncology.
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication |
Offset limit |
---|---|---|---|---|
05-12 | 0801 | 1354 | 383 | 0 limit 500 |
2051 | 0901 | 711 | 500 limit 1000 | |
05-13 | 0909 | 1429 | 361 | 1500 limit 500 |
1449 | 0258 | 702 | 2000 limit 1000 | |
05-14 | 0345 | 1306 | 677 | 3000 limit 1000 |
1512 | 2346 | 704 | 4000 limit 1000 | |
05-15 | 0540 | 1519 | 657 | 5000 limit 1000 |
1646 | 0109 | 642 | 6000 limit 1000, to EOF. |
Medical genetics
editQuery as for oncology.
Run | Started (UTC) | Ended (UTC) | Papers after de-duplication |
Offset limit |
---|---|---|---|---|
1 | 1607 | 2146 | 722 | 0 limit 722 |
2 | 1748 | ? | 289 | 1000 to EOF |
Oncology
edit- Baseline ingest query[1]
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication |
Offset limit |
---|---|---|---|---|
04-29 | 0859 | 1544 04-30 terminated "cancer" as dictionary term causing too many hits |
694 | 0 limit 1000 |
05-03 | 0814 | 0306 Dictionary with "cancer" removed trial run with 10 made first |
668 | 1010 limit 990 |
05-06 | 1804 | 0154 | 331 | 2000 limit 500 |
05-07 | 0514 | 1118 terminated by connection failure |
344 | 2500 limit 500 |
1518 | 2030 | 253 | 3000 limit 350 | |
2202 | 0819 | 474 | 3350 limit 650 | |
05-08 | 0855 | 0946 terminated by connection failure |
351 | 4000 limit 500 |
1151 | 1907 | 355 | 4500 limit 500 | |
2120 | 0951 | 333 | 5000 limit 500 | |
05-09 | 0811 | 1455 | 336 | 5500 limit 500 |
1531 | 0302 | 613 | 6000 limit 1000 | |
05-10 | 0552 | 1425 | 707 | 7000 limit 1000 |
1955 | 0956 | 661 | 8000 limit 1000 |
Infectious Disease
edit- Baseline ingest query[2]
- Residual checker query, URL
Offset | Limit | Run started | Papers to process | Run terminated | Residual, start | Residual, end | Article items, start | Article items, end | Comments |
---|---|---|---|---|---|---|---|---|---|
0 | 500 | 2019-04-01 16:22 |
346 | 16:31, terminated | 6635 | 430 | feedinfectiousdisease0.json Nearly all papers had a message like "Failed to process paper 6348913: Failed to upload paper: invalid character '<' looking for beginning of value"; just a few seemed to get to the "reconciling" stage. Nothing was written to the SS wiki. Rerun on 15 April, "papers found" was 346, ran through 0913 to 1558 with some individual papers failing. Residual query total for this specialty on 16 April was 6631. |
Records for runs starting with feedinfectiousdisease2.json.
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication Default batch 500 |
Offset |
---|---|---|---|---|
04-16 | 2109 | 0301 | 374 | 500 |
04-17 | 0537 | 1123 | 372 | 1000 |
1251 | 1827 | 364 | 1500 | |
2130 | 0252 | 341 | 2000 | |
04-18 | 0453 | 0953 | 359 | 2500 |
1244 | 1726 | 355 | 3000 | |
2050 | 0127 | 384 | 3500 | |
04-19 | 0514 | 1120 | 349 | 4000 |
1140 | 1439 | 385 | 4500 | |
1528 | 2211 | 363 | 5000 | |
04-21 | 0343 | 0945 | 344 | 5500 |
0954 | 1515 | 358 | 6000 | |
1536 | 2011 | 330 | 6500 | |
2111 | 0326 | 353 | 7000 | |
04-22 | 0336 | 0753 | 309 | 7500 |
2002 | 355 | 355 | 8000 | |
2117 | 0317 | 305 | 8600 | |
04-23 | 0503 | 0545 | 71 From 100 |
8500 |
0614 | 0657 | 63 To EOF |
9100 |
Nominal total coming to 6106, against specialty count of 5875, implies some double counting. This could be failures, after the deduplication stage, being found in later batches.
Endocrinology
edit- Compound ingest query, uses federation to check for presence of papers already on SS wiki.[3]
- "Residual query": On 16 April before any ingestion, there were 4347 items on the focus list, with endocrinology or a subspecialty containing a main subject, and not matching an article item on the SS wiki.
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication Default batch 500 |
Offset |
---|---|---|---|---|
04-16 | 0913 | 1508 | 399 | 0 |
0531 | 2032 | 369 | 500 | |
2132 | 0236 | 354 | 1000 | |
04-17 | 0638 | 1136 | 349 | 1500 |
1306 | 1841 | 353 | 2000 | |
04-23 | 0837 (Incorrect dictionaries used, ran through quickly) |
0851 | 309 | 2500 |
0918 | 1316 [Failure messages at end] |
344 | 3000 | |
04-24 | 0622 | 0626 | 10 | 3990, limit 10 Test run to check for failures, ran through OK |
0633 | 0953 | 343 | 3500 (suffix 8) | |
1100 | 1831 | 371 | 4000 | |
04-25 | 1639 | 2254 | 385 | 4500 |
04-26 | 0511 | ? | 336 | 5000 |
1216 | 1846 | 468 | 5500 limit 1000 EOF |
Cardiology
editQuery as for oncology.
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication |
Offset limit |
---|---|---|---|---|
05-16 | 1521 | 0238 | 1615 | Offset 0 limit 1500 |
Gastroenterology
editQuery as for oncology.
Date | Started (UTC) | Ended (UTC) | Papers after de-duplication |
Offset limit |
---|---|---|---|---|
05-16 | 0344 | 1112 | 542 | 0 limit 750 |
1143 | 1419 | 204 | 750 limit 250 |
Notes
edit- ↑ Try it!
#Compound ScienceSource ingest query #With checking of presence of paper on the SS wiki. #Checks for publishers found on Beall's list (final version, as found on Wikidata #Removes "no derivatives" Creative Commons licenses #Reverse chronological order used for selection. PREFIX ss: <http://sciencesource.wmflabs.org/entity/> PREFIX sst: <http://sciencesource.wmflabs.org/prop/direct/> SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel WHERE { ?item wdt:P31 wd:Q7318358; wdt:P5008 wd:Q55439927; wdt:P932 ?pmcid; wdt:P1433 ?journal; wdt:P1476 ?title; wdt:P577 ?date; wdt:P275 ?license; wdt:P921 ?mainsubject. ?mainsubject wdt:P1995 ?spec. ?spec wdt:P361* wd:Q162555. MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences MINUS {?item wdt:P275 wd:Q6937225} MINUS {?item wdt:P275 wd:Q19125045} MINUS {?item wdt:P275 wd:Q24082749} MINUS { ?item wdt:P5008 wd:Q55439927; wdt:P1433 ?journal. ?journal wdt:P123 ?publisher. { VALUES ?publisher {wd:Q52636754 wd:Q52635805 wd:Q4689899 wd:Q52620137 wd:Q4732612 wd:Q43080819 wd:Q30270870 wd:Q30297686 wd:Q52661346 wd:Q52636079 wd:Q52557383 wd:Q54958933 wd:Q2896740 wd:Q18712923 wd:Q52609680 wd:Q52609536 wd:Q52636154 wd:Q52609215 wd:Q80796 wd:Q52636535 wd:Q52633727 wd:Q52636944 wd:Q63254434 wd:Q52637577 wd:Q52665969 wd:Q52660711 wd:Q52659576 wd:Q56979398 wd:Q52670242 wd:Q29891111 wd:Q63254475 wd:Q52619294 wd:Q52662151 wd:Q7072722 wd:Q52609375 wd:Q7259709 wd:Q52636843 wd:Q45251004 wd:Q52637573 wd:Q52662489 wd:Q52635330 wd:Q47116994 wd:Q30267116 wd:Q24706265 wd:Q52620720 wd:Q52633876 wd:Q56416796 wd:Q52660351 wd:Q52635690 wd:Q7433770 wd:Q27991304 wd:Q55566796 wd:Q52619286 wd:Q30265175 wd:Q8035326 } } } MINUS { SERVICE <http://sciencesource-query.wmflabs.org/proxy/wdqs/bigdata/namespace/wdq/sparql> { ?articleitem sst:P3 ss:Q4; sst:P2 ?stritem. } BIND(substr(str(?item),32,39) AS ?stritem) .} SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ORDER BY DESC(?date) LIMIT 1000
- ↑ Try it!
SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel WHERE { ?item wdt:P31 wd:Q7318358; wdt:P5008 wd:Q55439927; wdt:P932 ?pmcid; wdt:P1433 ?journal; wdt:P1476 ?title; wdt:P577 ?date; wdt:P275 ?license; wdt:P921 ?mainsubject. ?mainsubject wdt:P1995 wd:Q788926. MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences MINUS {?item wdt:P275 wd:Q6937225} MINUS {?item wdt:P275 wd:Q19125045} MINUS {?item wdt:P275 wd:Q24082749} SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ORDER BY DESC(?date) OFFSET 1000 LIMIT 500
- ↑ Try it!
#Compound SS ingest query, with checking of presence of paper on the SS wiki. #Reverse chronological order used for selection. PREFIX ss: <http://sciencesource.wmflabs.org/entity/> PREFIX sst: <http://sciencesource.wmflabs.org/prop/direct/> SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel WHERE { ?item wdt:P31 wd:Q7318358; wdt:P5008 wd:Q55439927; wdt:P932 ?pmcid; wdt:P1433 ?journal; wdt:P1476 ?title; wdt:P577 ?date; wdt:P275 ?license; wdt:P921 ?mainsubject. ?mainsubject wdt:P1995 ?spec. ?spec wdt:P361* wd:Q162606. MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences MINUS {?item wdt:P275 wd:Q6937225} MINUS {?item wdt:P275 wd:Q19125045} MINUS {?item wdt:P275 wd:Q24082749} MINUS { SERVICE <http://sciencesource-query.wmflabs.org/proxy/wdqs/bigdata/namespace/wdq/sparql> { ?articleitem sst:P3 ss:Q4; sst:P2 ?stritem. } BIND(substr(str(?item),32,39) AS ?stritem) .} SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ORDER BY DESC(?date) LIMIT 500