Wikidata:ScienceSource project/ScienceSourceIngest notebook

This page for the ScienceSource project covers the use of the SSIngest tool to create content on the ScienceSource wiki

See:

Trials edit

Sample disease query
#Diseases within sleep medicine
SELECT DISTINCT ?item
  WHERE {
  ?item wdt:P31 wd:Q12136 .
  ?item wdt:P1995 ?medspec .
  ?medspec wdt:P361* wd:Q744029 .
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 10000
Try it!
Sample drug query
#Drugs used as treatment in sleep medicine
SELECT DISTINCT ?item 
  WHERE {
  ?item wdt:P31 wd:Q12140 .
  ?item wdt:P2175 ?condition  .
  ?condition wdt:P1995 wd:Q744029 .
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 5000
Try it!

Those types of queries can be converted into dictionaries by means of the aaraa tool.

Sample ingest query
SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
WHERE {
         ?item wdt:P31 wd:Q7318358;
               wdt:P5008 wd:Q55439927;
               wdt:P932 ?pmcid;
               wdt:P1433 ?journal;
               wdt:P1476 ?title;
               wdt:P577 ?date;
               wdt:P275 ?license;
               wdt:P921 ?mainsubject.           
         ?mainsubject wdt:P1995 wd:Q744029.
         MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
         MINUS {?item wdt:P275 wd:Q6937225}
         MINUS {?item wdt:P275 wd:Q19125045}
         MINUS {?item wdt:P275 wd:Q24082749}
         MINUS {?item wdt:P275 wd:Q6936496} #Remove NC license

        SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

        }
 LIMIT 300
Try it!
Sample ingest query for offset/limit batch
SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
WHERE {
         ?item wdt:P31 wd:Q7318358;
               wdt:P5008 wd:Q55439927;
               wdt:P932 ?pmcid;
               wdt:P1433 ?journal;
               wdt:P1476 ?title;
               wdt:P577 ?date;
               wdt:P275 ?license;
               wdt:P921 ?mainsubject.           
         ?mainsubject wdt:P1995 wd:Q788926.
         MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
         MINUS {?item wdt:P275 wd:Q6937225}
         MINUS {?item wdt:P275 wd:Q19125045}
         MINUS {?item wdt:P275 wd:Q24082749}
        
        SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

        }
ORDER BY
 DESC(?date)
OFFSET 1000
LIMIT 500
Try it!

Licensing edit

#Filter for most common forms of CC0, CC-by and CC-by-SA licenses
SELECT DISTINCT ?paper
WHERE 
      {
      ?paper wdt:P31 wd:Q13442814;
             wdt:P31 wd:Q7318358;
             wdt:P921 ?subject;
             wdt:P1476 ?title.
       {?paper wdt:P275 wd:Q6938433 } #CC0
        UNION {?paper wdt:P275 wd:Q6905323} UNION {?paper wdt:P31 wd:Q20007257} 
        UNION {?paper wdt:P31 wd:Q14947546} UNION {?paper wdt:P31 wd:Q18810333} UNION {?paper wdt:P31 wd:Q19125117} 
        ##CC-by, CC-by 4.0, CC-by 3.0, CC-by 2.5, CC-by 2.0
        UNION {?paper wdt:P31 wd:Q6905942} UNION {?paper wdt:P31 wd:Q18199165} 
        UNION {?paper wdt:P31 wd:Q14946043} UNION {?paper wdt:P31 wd:Q19113751} UNION {?paper wdt:P31 wd:Q19068220} 
        ##CC-by-SA, CC-by-SA 4.0, CC-by-SA 3.0, CC-by-SA 2.5, CC-by-SA 2.0    
      ?subject wdt:P1995 ?spec.
      ?spec wdt:P361* wd:Q788926
      }
Try it!

Production runs edit

Neurology edit

Query as for oncology.

Date Started (UTC) Ended (UTC) Papers
after de-duplication
Offset
limit
05-12 0801 1354 383 0
limit 500
2051 0901 711 500
limit 1000
05-13 0909 1429 361 1500
limit 500
1449 0258 702 2000
limit 1000
05-14 0345 1306 677 3000
limit 1000
1512 2346 704 4000
limit 1000
05-15 0540 1519 657 5000
limit 1000
1646 0109 642 6000
limit 1000, to EOF.

Medical genetics edit

Query as for oncology.

Run Started (UTC) Ended (UTC) Papers
after de-duplication
Offset
limit
1 1607 2146 722 0
limit 722
2 1748 ? 289 1000
to EOF

Oncology edit

  • Baseline ingest query[1]
Date Started (UTC) Ended (UTC) Papers
after de-duplication
Offset
limit
04-29 0859 1544 04-30
terminated
"cancer" as dictionary term causing too many hits
694 0
limit 1000
05-03 0814 0306
Dictionary with "cancer" removed
trial run with 10 made first
668 1010
limit 990
05-06 1804 0154 331 2000
limit 500
05-07 0514 1118
terminated by connection failure
344 2500
limit 500
1518 2030 253 3000
limit 350
2202 0819 474 3350
limit 650
05-08 0855 0946
terminated by connection failure
351 4000
limit 500
1151 1907 355 4500
limit 500
2120 0951 333 5000
limit 500
05-09 0811 1455 336 5500
limit 500
1531 0302 613 6000
limit 1000
05-10 0552 1425 707 7000
limit 1000
1955 0956 661 8000
limit 1000

Infectious Disease edit

Offset Limit Run started Papers to process Run terminated Residual, start Residual, end Article items, start Article items, end Comments
0 500 2019-04-01
16:22
346 16:31, terminated 6635 430 feedinfectiousdisease0.json
Nearly all papers had a message like "Failed to process paper 6348913: Failed to upload paper: invalid character '<' looking for beginning of value"; just a few seemed to get to the "reconciling" stage.
Nothing was written to the SS wiki.
Rerun on 15 April, "papers found" was 346, ran through 0913 to 1558 with some individual papers failing. Residual query total for this specialty on 16 April was 6631.

Records for runs starting with feedinfectiousdisease2.json.

Date Started (UTC) Ended (UTC) Papers
after de-duplication
Default batch 500
Offset
04-16 2109 0301 374 500
04-17 0537 1123 372 1000
1251 1827 364 1500
2130 0252 341 2000
04-18 0453 0953 359 2500
1244 1726 355 3000
2050 0127 384 3500
04-19 0514 1120 349 4000
1140 1439 385 4500
1528 2211 363 5000
04-21 0343 0945 344 5500
0954 1515 358 6000
1536 2011 330 6500
2111 0326 353 7000
04-22 0336 0753 309 7500
2002 355 355 8000
2117 0317 305 8600
04-23 0503 0545 71
From 100
8500
0614 0657 63
To EOF
9100

Nominal total coming to 6106, against specialty count of 5875, implies some double counting. This could be failures, after the deduplication stage, being found in later batches.

Endocrinology edit

  • Compound ingest query, uses federation to check for presence of papers already on SS wiki.[3]
  • "Residual query": On 16 April before any ingestion, there were 4347 items on the focus list, with endocrinology or a subspecialty containing a main subject, and not matching an article item on the SS wiki.
Date Started (UTC) Ended (UTC) Papers
after de-duplication
Default batch 500
Offset
04-16 0913 1508 399 0
0531 2032 369 500
2132 0236 354 1000
04-17 0638 1136 349 1500
1306 1841 353 2000
04-23 0837
(Incorrect dictionaries used,
ran through quickly)
0851 309 2500
0918 1316
[Failure messages at end]
344 3000
04-24 0622 0626 10 3990, limit 10
Test run to check for failures,
ran through OK
0633 0953 343 3500 (suffix 8)
1100 1831 371 4000
04-25 1639 2254 385 4500
04-26 0511 ? 336 5000
1216 1846 468 5500
limit 1000
EOF

Cardiology edit

Query as for oncology.

Date Started (UTC) Ended (UTC) Papers
after de-duplication
Offset
limit
05-16 1521 0238 1615 Offset 0
limit 1500

Gastroenterology edit

Query as for oncology.

Date Started (UTC) Ended (UTC) Papers
after de-duplication
Offset
limit
05-16 0344 1112 542 0
limit 750
1143 1419 204 750
limit 250

Notes edit

  1. #Compound ScienceSource ingest query
    #With checking of presence of paper on the SS wiki.
    #Checks for publishers found on Beall's list (final version, as found on Wikidata
    #Removes "no derivatives" Creative Commons licenses
    #Reverse chronological order used for selection.
    
    PREFIX ss: <http://sciencesource.wmflabs.org/entity/>
    PREFIX sst: <http://sciencesource.wmflabs.org/prop/direct/>
    
    SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
    WHERE {
             ?item wdt:P31 wd:Q7318358;
                   wdt:P5008 wd:Q55439927;
                   wdt:P932 ?pmcid;
                   wdt:P1433 ?journal;
                   wdt:P1476 ?title;
                   wdt:P577 ?date;
                   wdt:P275 ?license;
                   wdt:P921 ?mainsubject. 
      ?mainsubject wdt:P1995 ?spec.
              ?spec wdt:P361* wd:Q162555.
             MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
             MINUS {?item wdt:P275 wd:Q6937225}
             MINUS {?item wdt:P275 wd:Q19125045}
             MINUS {?item wdt:P275 wd:Q24082749}
      
      MINUS {
             ?item wdt:P5008 wd:Q55439927;
                   wdt:P1433 ?journal.
             ?journal wdt:P123 ?publisher.
        
              {
               VALUES ?publisher
     
              {wd:Q52636754 wd:Q52635805 wd:Q4689899 wd:Q52620137 wd:Q4732612 
               wd:Q43080819 wd:Q30270870 wd:Q30297686 wd:Q52661346 wd:Q52636079 
               wd:Q52557383 wd:Q54958933 wd:Q2896740 wd:Q18712923 wd:Q52609680 
               wd:Q52609536 wd:Q52636154 wd:Q52609215 wd:Q80796 wd:Q52636535 
               wd:Q52633727 wd:Q52636944 wd:Q63254434 wd:Q52637577 wd:Q52665969 wd:Q52660711
               wd:Q52659576 wd:Q56979398 wd:Q52670242 wd:Q29891111 wd:Q63254475 wd:Q52619294 
               wd:Q52662151 wd:Q7072722 wd:Q52609375 wd:Q7259709 wd:Q52636843 
               wd:Q45251004 wd:Q52637573 wd:Q52662489 wd:Q52635330 wd:Q47116994 
               wd:Q30267116 wd:Q24706265 wd:Q52620720 wd:Q52633876 wd:Q56416796 
               wd:Q52660351 wd:Q52635690 wd:Q7433770 wd:Q27991304 wd:Q55566796 
               wd:Q52619286 wd:Q30265175 wd:Q8035326 
               } 
            } }    
      
      MINUS { SERVICE <http://sciencesource-query.wmflabs.org/proxy/wdqs/bigdata/namespace/wdq/sparql>
           { ?articleitem sst:P3 ss:Q4;
                sst:P2 ?stritem. }
    
              BIND(substr(str(?item),32,39) AS ?stritem) .}
            
            SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    
            }
    ORDER BY
     DESC(?date)
    LIMIT 1000
    
    Try it!
  2. SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
    WHERE {
             ?item wdt:P31 wd:Q7318358;
                   wdt:P5008 wd:Q55439927;
                   wdt:P932 ?pmcid;
                   wdt:P1433 ?journal;
                   wdt:P1476 ?title;
                   wdt:P577 ?date;
                   wdt:P275 ?license;
                   wdt:P921 ?mainsubject.           
             ?mainsubject wdt:P1995 wd:Q788926.
             MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
             MINUS {?item wdt:P275 wd:Q6937225}
             MINUS {?item wdt:P275 wd:Q19125045}
             MINUS {?item wdt:P275 wd:Q24082749}
            
            SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    
            }
    ORDER BY
     DESC(?date)
    OFFSET 1000
    LIMIT 500
    
    Try it!
  3. #Compound SS ingest query, with checking of presence of paper on the SS wiki. 
    #Reverse chronological order used for selection.
    
    PREFIX ss: <http://sciencesource.wmflabs.org/entity/>
    PREFIX sst: <http://sciencesource.wmflabs.org/prop/direct/>
    
    SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
    WHERE {
             ?item wdt:P31 wd:Q7318358;
                   wdt:P5008 wd:Q55439927;
                   wdt:P932 ?pmcid;
                   wdt:P1433 ?journal;
                   wdt:P1476 ?title;
                   wdt:P577 ?date;
                   wdt:P275 ?license;
                   wdt:P921 ?mainsubject. 
      ?mainsubject wdt:P1995 ?spec.
              ?spec wdt:P361* wd:Q162606.
             MINUS {?item wdt:P275 wd:Q36795408} #Remove these ND licences
             MINUS {?item wdt:P275 wd:Q6937225}
             MINUS {?item wdt:P275 wd:Q19125045}
             MINUS {?item wdt:P275 wd:Q24082749}
      
      MINUS { SERVICE <http://sciencesource-query.wmflabs.org/proxy/wdqs/bigdata/namespace/wdq/sparql>
           { ?articleitem sst:P3 ss:Q4;
                sst:P2 ?stritem. }
    
              BIND(substr(str(?item),32,39) AS ?stritem) .}
            
            SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    
            }
    ORDER BY
     DESC(?date)
    LIMIT 500
    
    Try it!