Wikidata:WikidataCon 2017/Notes/An open source tool for fishing Wikidata entities in text and PDF documents

Title: An open source tool for fishing Wikidata entities in text and PDF documents

Speaker(s) edit

Name or username: Patrice Lopez

Useful links:

https://github.com/kermitt2/nerd

http://entity-fishing.science-miner.com

http://nerd.readthedocs.io

Abstract edit

entity-fishing (repo: https://github.com/kermitt2/nerd, demo: http://entity-fishing.science-miner.com, documentation: http://nerd.readthedocs.io) is an open source tool dedicated to the automatic identification and disambiguation of Wikidata entities in multilingual text and PDF documents. The tool is based on machine-learning techniques exploiting Wikipedia as training source. entity-fishing offers high performance and scalability and is totally generic in term of domains and languages. It can thus address a large variety of usages. Our work focuses more particularly on processing scholarly documents, taking advantage of the massive amount of scientific knowledge and links present in Wikidata.

 
A view of entity-fishing demo console.

Collaborative notes of the session edit

Entity recognizer

Grobid-NER - https://github.com/kermitt2/grobid-ner

LMDB

Entitiy embedding.