Wikidata:WikiProject LD4 Wikidata Affinity Group/Affinity Group Calls/Meeting Notes/2022-07-12

Call Details edit

Date: 2022-07-12
Topic: Enriching and Validating Wikidata from Large Bibliographic Databases
Presenters: Eric Willey and Houcemeddine Turki
Link to original agenda with link to recording: https://docs.google.com/document/d/19fWaod_qy2J5y6Mqjbnccen7nyb4nj6EnudCDouefQU/edit

Presentation Materials edit

Slides

Notes edit

Created in 2021, first research structure in Tunisia
- Team composed of members from various disciplines, universities
- Adapting Wikidata to support clinical practice
Introduction
- Items are aligned to external biomedical resources (PubMed, MeSH, etc.)
- Wikidata statements supported by references
Biomedical Knowledge in WD
- Various types of biomedical items
- Multiple languages, mostly European and Asian
- Uneven coverage of natural languages for biomedical entities
Parsing WD
- Use Wikidata Query Service, Mediawiki API
- Finding insights
- Synthesizing data based on integrating information
Easily extensible
- Everyone can create new items, apply for new properties, easy creation of data models, Easy alignment to external resources, intuitive embedding in bots, possible change of data models upon community consent
Biomedical entities dominated by genes and proteins
- Many classes of biomedical items poorly supported by references
What Wikidata really needs
- A way to allow relation extraction, relation classification…
Can use Scholarly Publications to do this
- 1.6 million papers issued, indexed in PubMed, also Web of Science, PubMed Central, DataCite
Research publications in brief
- Full texts, cannot analyze full text (huge size, include natural language, tables, etc.)
- But bibliographic data in references is limited size, structured, and annotated by design
PubMed search tags
- Can be used to enrich bibliographic metadata in WD despite several legal concerns
- Processing data can be used to enrich scientific knowledge in Wikidata
MESH Keywords
- Controlled keywords assigned to PubMed Records by the curators of NCBI databases
- Biopython python library
Relation classification
- Relation classification based on MeSH keywords
- Tried to find associations between keywords
- Need a dataset of biomedical relations
- Concepts assigned labels
- Taxonomic relations and non-taxonomic relations
- Property constraints
- Aligned to MeSH terms
Formulated a SPARQL query accordingly
Biomedical Relation Classification
- Machine Learning Models => Evaluation Metrics
Machine Learning Models
- D-Net: Fully Connected or Dense Model
- Machine Learning Models
- C-Net: Convolutional Neural Networks (CNNs)
- Evaluation Metrics
  - Accuracy
  - Precision
  - Recall
MeSH2 Matrix Generation
- Biomedical Relation Classification
  - 29,000 samples
  - Good results, accuracy 78%, 83%
- Data availability
  - Available via GitHub
Relation, Extraction, and Validation
- Pointwise Mutual Information
  - A simple measure of association between entities
  - Used in computational linguistics for finding associations between words
  - MeSH Keywords are predefined and formatted
Process for relation extraction and validation
- Extract and compute PMI between MeSH keywords
- Find relation types between MeSH keywords
- Formulate query and search in PubMed
Finding relations between keywords
- 30% as a training set
- 70% as a test set
- Classifying extracted association
- Human validation
Reference Identification
- Process: Extract unreferenced WD statements => ID most relevant PubMed Central publications => Find the supporting sentence for claims => Align PMC ID with WD ID => Add obtained references to WD
Principles
- Find MeSH equivalent of subject and object, for relation type
Tools for Bot Creation
- Recommend:
  - Wikibase Integrator: Work with statements in WD, python library
  - Wikidata Hub
  - Wikidata Query Service: Can submit some queries in Python
  - Biopython: Python library for working with bibliographic info, mainly from PubMed and PubMed Central

Questions edit

Q: How did you gather group of people: Conferences, because people are shy make consortiums (ex: Machine Learning community) so people can get in. Because there is huge work that can be done by a lot of people, can share work in groups, then you get some collaborators.
Question: did you need to address potential "spam" or manipulation of MeSH headings in the medical literature? Or did you just rely on using reputable domains?: Depends, mainly using the probabilistic approach, many papers having given association means it’s probably correct. Because manipulation is possible at micro scale, but difficult at macro scale. The MeSH headings are rated by ACBI curators or staff/employees, so kind of reliable. Noted retracted and partially retracted literature, PubMed flags retractions. Use that tag to prevent using information that is not relevant. Important
Have you used any tool to clean data preIngestion tool: Drag and drop process, not human generated keyword, mainly an item chosen from a list. Not a problem with having irrelevant words or wrong words, typos, etc. Terms are pre-defined, controlled. No redundancy, messy data.