User:Charles Matthews/ContentMine workshop 15 December 2018

Programme for [ContentMine workshop] 15 December 2018, Makespace classroom, 16 Mill Lane, Cambridge; see m:Meetup/Cambridge/39.


Introductory videos:

Number Video link Topic
1 Introduction to the project
2 ScienceSource focus list
3 What's a neglected disease?
4 ScienceSource annotations

Control contentEdit

"In the beginning was the dictionary"

ContentMine dictionaries are

  • lists of search terms like "leukemia",
  • paired with Wikidata items such as leukemia (Q29496), and
  • usually extracted from Wikidata by a query.



It is Wikimedia's knowledge base, multilingual and part of the Semantic Web family of machine-readable sites. It is also illustrated with several million images.

Our interests are mainly in drugs, diseases and scientific papers, which are not the best topics to show off the illustrations. But here is something in the drug field:

#ImageGrid for compounds "-sterone".
SELECT DISTINCT ?item ?itemLabel ?pic
	?item wdt:P31 wd:Q11173;
       rdfs:label ?itemLabel; 
       FILTER (lang(?itemLabel) = "en")  
       FILTER regex (?itemLabel, "(sterone)$")
	OPTIONAL { ?item wdt:P18 ?pic }

Try it!

So what is that?


The query language common to Wikidata and other Semantic Web sites.

First activity: run this query.

#ImageGrid for taxon
SELECT DISTINCT ?item ?itemLabel ?pic
	?item wdt:P31 wd:Q16521;
      wdt:P171* wd:Q21860.
	OPTIONAL { ?item wdt:P18 ?pic }

Try it!

Then replace Q21860 with the number on your card, and run it again.

Content close-upsEdit


Roughly speaking, data that can be used for cataloguing purposes.


Layers of commentary about a given text, made up of comments pointing to places in it, directly or indirectly.


Two search terms found in the same text. To see examples:

lists sample article texts, alphabetically. Given the Q-number navigate to the item using the browser line, with Item:Q... . Use "What links here" to find anchor points. From anchor point statements find annotations. Locate the terms in the actual text.

Assess contentEdit

The ScienceSource project wiki at will apply SPARQL to explore uploaded papers, in a number of ways. User:Charles Matthews/ScienceSource queries is a warehouse of queries, in a readable form.


Infographics that represent where search term annotations lie, when you divide up a text into a number of parts.


Formula used to rank search term annotations in a corpus of texts.

A table like this, but on a larger scale, can record how many hits for each dictionary term in a corpus of texts.

text term1 term2 term3 term4 term5
text1 3 0 0 4 1
text2 5 1 2 0 2
text3 0 0 0 2 0

The idea is to scale each row with a factor (TF) that takes into account the length of the text, and each column by a factor (IDF) that varies inversely with the number of non-zero entries. Then all the entries are ranked by "interestingness".

With some caveats, SPARQL can carry out this ranking within ScienceSource, on batches of articles.


Guideline used on Wikimedia about "medical reliable sources", in order to decide which citations are acceptable for health information.

A subproblem is to exclude papers from "predatory publishers". Third activity: hunt the predator!

#Journals without publishers on ScienceSource focus list
SELECT DISTINCT ?journal ?journalLabel
  WHERE {?item wdt:P5008 wd:Q55439927;
               wdt:P1433 ?journal.
         MINUS {?journal wdt:P123 ?publisher}
         SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

Try it!

Auxiliary query: replace Q5506062 by Q-number of journal.

#Query to check for articles published in a given journal.
SELECT ?article ?articleLabel
 WHERE {?article wdt:P1433 wd:Q5506062.
        SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

Try it!