Wikidata:WikiFactMine/Annotation for fact mining

This page is a project proposal. Please help develop it by contributing on the Talk page.

WikiFactMine proposes to improve its fact-mining pipeline, in particular for Wikidata statements, by modifying the inputs and direct outputs of its system. The internal processes, such as the use of dictionaries, are not expected to change substantially.

The overall idea is to create a type of site new to Wikimedia, on which text mining can be done of various kinds. A particular proposal is being developed, to show the possibilities.

Wiki platform

It is planned to set up a MediaWiki site, which would host the full text of open access scientific papers: these could be added, realistically, at a rate of 1000 per week say, in wikitext form. This content would be static, and the hosting pages would be protected.

Annotations would be robustly attached to points in the paper (or, for metadata, to a special point). They would not be stored in the text, but elsewhere, as a data structure conforming to the W3C standard for annotation.

Neither the idea of a site storing the text of open access papers, nor the idea of an annotation system of this kind, is innovative in itself. The idea is to exploit these concepts, by designing machine-readable annotations able to take the strain and serve as a data structure for algorithms required to add value to Wikidata.

Inputs

The most desirable type of paper to download to the system is the systematic review (Q1504425) of relevance to medicine. The project would aim to take the exact characterisation of its papers as sources seriously, by adopting and formalising w:WP:MEDRS, a Wikipedia guideline on reliability of sources for medical referencing (rather than purely scientific purposes). To deliver this objective, work needs to be done on the guideline (to convert it via a data model into a decision algorithm), and the source metadata available. This work would build on advances made this year on the WikiCite front.

Outputs

Simply put, an annotation that is close enough to the input format for the QuickStatements (Q20084080) could be "harvested" from the annotations by a bot, and added to Wikidata. The proposal is that such a bot would apply an algorithm that correctly combined relevant source metadata, and human approval of the fact mining that had occurred. Adequate records would exist in the annotation to allow troubleshooting of the additions.

This kind of process would address difficulties in the current semi-automated routes for adding mined facts.

Listing biomedical properties

The application of MEDRS would be to statements involving a particular core list of Wikidata properties, such as medical condition treated (P2175). This list would have to be defined and agreed as part of the project. The requirements of particular Wikipedia infoboxes would need to be taken into account, and the algorithmic part of processing subjected to an appropriate case analysis. Detailed work may be required here.

Community

The intention is to build a versatile site, allowing for numerous styles of annotation useful for scientific literature, not just the marking up of facts with QuickStatements code. Deciding the directions of other developments (e.g. semantic markup) would be a task put in the hands of a community on the site, as would bot approvals.

Creation of WFM-specific codes

One particular function of the community would be to decide on what "pseudo-properties"—surrogates for actual Wikidata properties—could be allowed in machine-readable annotations. Suppose they were A-numbers rather than P-numbers. Then code such as

Q30123456 A321 Q7654321

could represent a triple-like statement, such as could occur on Wikidata, but for which no property P yet existed on Wikidata. For example, the first Q-number might be the code for the paper in question, and the second some concrete aspect of the technique used, with A referring to the technique itself. It would be convenient to have such an auxiliary language, and in time the A-numbers could be retired if P-numbers were created to do the same job.

But in any case, once such a system is set up, community processes to debate and maintain it would be needed.

Markup for other text mining

The application of other text mining (Q676880) techniques to the hosted papers is envisaged. The details would be subject to community decisions.