Wikidata:WikiFactMine/ScienceSource

ScienceSource is a proposed project of ContentMine, and part of the WikiFactMine initiative
This is an old draft document, of no particular status: see m:Grants:Project/ScienceSource for the grant proposal.

The aim of the ScienceSource project is to apply existing and new techniques to literature search, for and with the help of Wikidata. It would operate both at the "micro" level of fact mining of scientific literature, and by making appropriate use of source metadata.

The problem to solve becomes the gathering of information, in and around Wikidata; and using it to create new processes and workflows to deal with the issues, in particular, of the biomedical literature. Via infoboxes, Wikidata statements are used across hundreds of Wikipedia editions in different languages. A more rigorous scrutiny, and an audit trail for referencing, would by these means be applied to biomedical content within Wikimedia.

Source repository

edit

The project is based on a new ScienceSource repository, which would be a MediaWiki site. The full content of selected open access papers would be downloaded there. The emphasis would be on biomedical papers, with special attention to reviews; WikiJournal papers within scope should also be included. Added value would take the form of an annotation layer, which would include entity identification in WikiFactMine style, by means of custom dictionaries, and semantic identification of relational content. Further types of annotation and metadata would be sought, in particular via onwiki community input.

Software development

edit

Software features required for the project include:

  • Downloader for papers, drawing on multiple repositories (basis being the existing ContentMine quickscrape tool
  • Annotation client, intended to be derived from the open source hypothesis client
  • export and visualisation tools.

The WikiFactMine approach to co-occurrence of terms in papers would also be refined, for example by using a two-pass method.

Data import

edit

Advantage would be taken of the existence of metadata and tagging on papers, for example in the PubMed family of sites. Where there are no licensing restrictions, such data can be added to the existing holdings on Wikidata, on the items devoted to scientific article, now (as of January 2018) numbering over 12 million.

Data modelling, export and reuse

edit

ScienceSource would have as a main objective to adapt the MEDRS sources guideline from the English Wikipedia, so that more cases could be handled by automation. While work in this direction is likely to result in good filtering of acceptable sources, into "high quality" and "unacceptable", there is also going to be an intermediate "grey area" where the project should work with community experts. Holding the results of discussions as a body of case studies, largely formalised in annotations of standard types, should capture expert insights.

Where triple-like facts can be extracted from papers by fact mining techniques, and the papers pass the tests applied to filter sources for referencing, then the facts can be used to populate Wikidata items (or to provide better references for existing facts). The project would work on automating this route.

Scientific papers also contain tabulated data, and images of different kinds (figures, graphs, maps, photographs ...) Added value markup for both these classes is possible, with the table namespace on Wikimedia Commons, and the data model currently being developed for structured data there.

There would also be an application to visualisation of citation graphs. At present, simple visualisations (e.g. for papers citing papers citing the Crick-Watson paper on DNA) can be too naive to be useful. Value can be added to the graphical representation of which papers cite which, using filtering, where more is known about individual papers. This direction leads to more refined forms of "pyramid search": the study of one article by means of all the papers it cites.

Community

edit

For this project, WikiFactMine would sustain its program of community engagement, which in 2017 included a monthly mass message (Facto Post), meetups and training sessions. It would again attend the Wikimania and WikiCite conferences, and would support an event to mark the sixth anniversary of Wikidata.