Created in 2021, first research structure in Tunisia
Team composed of members from various disciplines, universities
Adapting Wikidata to support clinical practice
Introduction
Items are aligned to external biomedical resources (PubMed, MeSH, etc.)
Wikidata statements supported by references
Biomedical Knowledge in WD
Various types of biomedical items
Multiple languages, mostly European and Asian
Uneven coverage of natural languages for biomedical entities
Parsing WD
Use Wikidata Query Service, Mediawiki API
Finding insights
Synthesizing data based on integrating information
Easily extensible
Everyone can create new items, apply for new properties, easy creation of data models, Easy alignment to external resources, intuitive embedding in bots, possible change of data models upon community consent
Biomedical entities dominated by genes and proteins
Many classes of biomedical items poorly supported by references
What Wikidata really needs
A way to allow relation extraction, relation classification…
Can use Scholarly Publications to do this
1.6 million papers issued, indexed in PubMed, also Web of Science, PubMed Central, DataCite
Research publications in brief
Full texts, cannot analyze full text (huge size, include natural language, tables, etc.)
But bibliographic data in references is limited size, structured, and annotated by design
PubMed search tags
Can be used to enrich bibliographic metadata in WD despite several legal concerns
Processing data can be used to enrich scientific knowledge in Wikidata
MESH Keywords
Controlled keywords assigned to PubMed Records by the curators of NCBI databases
Biopython python library
Relation classification
Relation classification based on MeSH keywords
Tried to find associations between keywords
Need a dataset of biomedical relations
Concepts assigned labels
Taxonomic relations and non-taxonomic relations
Property constraints
Aligned to MeSH terms
Formulated a SPARQL query accordingly
Biomedical Relation Classification
Machine Learning Models => Evaluation Metrics
Machine Learning Models
D-Net: Fully Connected or Dense Model
Machine Learning Models
C-Net: Convolutional Neural Networks (CNNs)
Evaluation Metrics
Accuracy
Precision
Recall
MeSH2 Matrix Generation
Biomedical Relation Classification
29,000 samples
Good results, accuracy 78%, 83%
Data availability
Available via GitHub
Relation, Extraction, and Validation
Pointwise Mutual Information
A simple measure of association between entities
Used in computational linguistics for finding associations between words
MeSH Keywords are predefined and formatted
Process for relation extraction and validation
Extract and compute PMI between MeSH keywords
Find relation types between MeSH keywords
Formulate query and search in PubMed
Finding relations between keywords
30% as a training set
70% as a test set
Classifying extracted association
Human validation
Reference Identification
Process: Extract unreferenced WD statements => ID most relevant PubMed Central publications => Find the supporting sentence for claims => Align PMC ID with WD ID => Add obtained references to WD
Principles
Find MeSH equivalent of subject and object, for relation type
Tools for Bot Creation
Recommend:
Wikibase Integrator: Work with statements in WD, python library
Wikidata Hub
Wikidata Query Service: Can submit some queries in Python
Biopython: Python library for working with bibliographic info, mainly from PubMed and PubMed Central
Q: How did you gather group of people: Conferences, because people are shy make consortiums (ex: Machine Learning community) so people can get in. Because there is huge work that can be done by a lot of people, can share work in groups, then you get some collaborators.
Question: did you need to address potential "spam" or manipulation of MeSH headings in the medical literature? Or did you just rely on using reputable domains?: Depends, mainly using the probabilistic approach, many papers having given association means it’s probably correct. Because manipulation is possible at micro scale, but difficult at macro scale. The MeSH headings are rated by ACBI curators or staff/employees, so kind of reliable. Noted retracted and partially retracted literature, PubMed flags retractions. Use that tag to prevent using information that is not relevant. Important
Have you used any tool to clean data preIngestion tool: Drag and drop process, not human generated keyword, mainly an item chosen from a list. Not a problem with having irrelevant words or wrong words, typos, etc. Terms are pre-defined, controlled. No redundancy, messy data.