Wikidata:Events/Data Quality Days 2021

2022 >>
Data Quality Days
8-15 September 2021

📆 Events

📗 Resources

👥 Participants

Projects board

🏅 Outcomes

💬 Telegram group

🗒 Discussions


Welcome!
The Data Quality Days are a series of gatherings that took place from September 8th to 15th, 2021, focusing on data quality on Wikidata. With presentations, discussions, editing sprints and more, the goals of this event were:
  • to start some discussions about data quality and highlight this topic through various angles
  • to explore what data quality means in different areas of Wikidata
  • to bring together people who are working on data quality on Wikidata and who want to contribute
  • to highlight and create tools that can be useful when working on data quality
Documentation
The Data Quality Days 2021 now concluded. You can find an overview of the events with their description, slides, notes and possibly recording below. On the outcomes page, you can find the interesting things that happened during the event. (if you participated, feel free to add yours!)


Events edit

📆 Day and time (GMT/UTC) ⏰ Duration 💬 Title & short description ℹ️ Type 👥 Facilitator(s) 🌐 Main language ⏯️ Access/replay & 🖋 notes Number of participants
30min Opening session

Data quality: what is it and why is it important?

Presentation + discussion Lydia Pintscher, Manuel Merz, Alessandro Piscopo English
 
opening presentation of the Wikidata Data Quality Days
35
30min Bibliometric-Enhanced Information Retrieval: A new alternative for the validation and enrichment of Wikidata Statements

In this brief presentation, I explain how bibliographic metadata of scholarly publications can be used to verify, validate and enrich specialized knowledge in Wikidata through several practical examples. I will also show RefB, a bot that adds reference support to Biomedical Wikidata Statements based on PubMed Central queries, as a practical example of how Wikidata can be enriched and sustained through leveraging open bibliographic data.

Presentation + discussion Houcemeddine Turki English Notes

Slides

18
60min Structuring the world’s knowledge: Socio-technical processes and data quality in Wikidata.

In his talk, Alessandro presents his research about the socio-technical fabric of Wikidata and how this affects the quality of its data, looking in particular at three aspects: quality of provenance and ontological data; algorithmic (bot) contributions; emerging editor activity patterns.

Presentation + discussion Alessandro Piscopo English
 
Slides.
19
all day Deduplication

Editing session to expand Help:Deduplication, about identifying and removing duplicates. Contribute to expand the help page by adding more information, expand its item or raise open questions on its talk page

editing session n/a (Jura1) English Help:Deduplication
Help talk:Deduplication
Q108404839
60min ORES: Using AI for quality control in Wikidata

In this session we will explain what ORES is and how it can be used for data quality work.

Presentation + discussion Lydia Pintscher, Ladsgroup English Notes
 
slides
video
22
60min EntitySchemas and Shape Expressions on Wikidata

This session consists of three parts. First a 15 minute introduction to ShEx will be given. Followed by a 30 minutes hands on where a new entity schema will drafted (we take requests). In the final 15 minutes Eric will discuss possible directions.

Presentation/Handson/Discussion Andra Waagmeester, Eric Prud'hommeaux, Katherine Thornton, Jose Emilio Labra Gayo English Notes - Slides 29
all day Ranks

Editing session to review and expand Help:Ranking. Ranking is a key concept of Wikidata to integrate multiple and evolving views about reality. Contribute to expand the help page by adding more information or by commenting on its talk page.

editing session n/a (Jura1) English Help:Ranking
Help talk:Ranking
60min Mismatch finder and beyond: How can we incorporate feedback from our biggest data re-users at scale?

In this session we will take a look at why it is important to get more data quality feedback from outside and how we currently think about it. We'll show how the upcoming Mismatch Finder fits into it and discuss how to go from there.

Presentation + discussion Lydia Pintscher, Manuel Merz English Notes
 
slides
video
9
45min Bringing Czech authority files into 21st century: Integration with Wikidata

Fifteen years have already passed since the first collaboration between (then) Wikipedians and the Czech National Library. Their database of authority files stands at the crossroad of all data related to Czechia - bibliographical, personal, geographical etc. Over the years, we have learned how to link their entries to Wikidata items, display mutual links and enrich authority files with automatic links to ISNI and ORCID, export MARC files into a CC0-licensed wikibase and run various events aimed at promoting the cross-pollination between the worlds of libraries and Wikidata. Many ideas can likely be replicated worldwide.

30 min Presentation + 15 min Q&A Vojtěch Dostál English Video
all day Dates

Editing session to review and expand Help:Dates. Adding and querying dates may seem simple, but available precision and changing calendars add complexity. Contribute to expand the help page by adding more information or formulate open questions on its talk page.

editing session n/a (Jura1) English Help:Dates
Help talk:Dates
60min Wikidata Live Editing

Property Constraints: What to do when you see warnings, how to improve the constraints and how to query using constraints.

Exploration and how-to Abbe98, Ainali English YouTube, Facebook
all day Checks after upload

Editing session to review and expand Help:Checks after upload. Once a dataset is uploaded to Wikidata, what to do? Contribute to expand the help page by adding checks you found useful.

editing session n/a (Jura1) English Help:Checks after upload
Help talk:Checks after upload
60min Constraint-a-thon

Setting property constraints. Follows on from the 'Live Editing' session above. This will focus on hands-on work by all attendees (bring your favorite property ID!)

Editing (with brief introduction at the start) Abián, Mike Peel English
 
Slides
45min Periodic editathons as a way to improve data quality: an experiment in Italy

From March 2021 the Gruppo Wikidata per Musei, Archivi e Biblioteche (GWMAB) has been organizing a series of monthly editathons, aiming to involve Italian-speaking users in the improvement of the data quality of items of authors whose works are present in Italian libraries. This presentation will show the organization of the editathons and the results achieved; ideas and proposals of improvements are welcome in the discussion.

30 min Presentation + 15 min Q&A Epìdosis English
 
Slides

Video

60min Overview of ontology issues

We looked into different types of ontology issues in our data and tried to come up with a classification. We'll present the current state and would love your feedback to understand if it is meaningful and helpful for further work.

Presentation and discussion Silvan Heintze, Lydia Pintscher English Notes
 
slides
video
29
30min Clarifying property application for effective SPARQL queries

By way of SPARQL queries to inform and formulate best practices on properties and qualifiers application in Wikidata.

At the Smithsonian we are trying to model top leadership with the job title “director” -- may be director of a named museum or a unit within a museum or an independent organization such as the Center for Folklife and Cultural Heritage. Currently in Wikidata people with similar positions are modeled under “museum director” (Q22132694) and “director”(Q1162163) interchangeably, part of statements with properties including occupation (P106), position held (P39), and as qualifiers under the employer (P108) organization. There are multiple ways of representing the same information depending on the editor’s understanding of the definition of various related properties, and this complicates the writing of SPARQL queries. Attempts to compile a list of such directors proved very complex given the wide variety of ways these individuals are recorded in Wikidata. We are seeking to get a better sense of the consensus around how this type of individual’s job title should be described in Wikidata.

Short presentation then discussions Jackie Shieh (ShiehJ)

Diane Shaw (Uncommon fritillary) Amy Watson (WatsonAmy)

Smithsonian Libraries & Archives

English Notes

Slides

22
60min Bug triage hour

We will be looking at tickets related to data quality and maintenance. Bring your tickets for bugs or new features you'd like to discuss.

Discussion Lydia Pintscher, Manuel Merz English Notes 10
60min Cross-checking on-wiki: visibility, duplication, migration

We will talk about increasing data quality by increasing Wikidata's on-wiki visibility (e.g., through infoboxes). We will also discuss the benefits and drawbacks of duplicating data on Wikidata and other Wikimedia projects, and migrating information to Wikidata, with a focus on Commons category links. This will include hands-on editing of some tricky links between enwiki and commons via wikidata.

Presentation/discussion (20 mins)
hands-on (40+ mins)
Mike Peel English Notes
 
Slides
7
60min Discover patrolling & quality tools

Let's talk about tools! You're welcome to present the tools that you use or develop, or join to discover new interesting tools. Presentations/demos should be short, maximum 10min per tool. If you plan to present a tool, feel free to add it here.

Discussion Mohammed Sadat + anyone who wants to show a tool English Notes
 
soweego slides
19
30min Closing discussion

What did we learn during the Data Quality Days? Any outcomes to share? How do we want to move forward from there? What are your ideas and wishes to improve data quality on Wikidata?

Discussion Lydia Pintscher, Manuel Merz English 15

Resources edit

Feel free to add more useful documents, pages or videos here.

  • Wikidata:WikiProject Data Quality
  • Wikidata quality a data consumers' perspective (WikidataCon 2017)
  • Workshop on Data Quality Management in Wikidata (Q59426297)  
  • Data quality panel (WikidataCon 2019)
  • Kartik Shenoy; Filip Ilievski; Daniel Garijo; Daniel Schwabe; Pedro Szekely (30 June 2021), A Study of the Quality of Wikidata, arXiv:2107.00156, Wikidata Q107425133 
  • data quality (Q1757694)  

Bibliometric-Enhanced Information Retrieval

Participants edit

If you plan to join one or several events, or to work on projects related to data quality during this period, you can sign up here!
Feel free to indicate your username, the languages that you speak, and the topics you're interested in.

  • Lea Lacroix (WMDE) - fr, en, de, it - I'd like to understand more about how the Wikidata community maintains and improve data quality
  • Epìdosis - it, en - as of now I'm involved in some projects for improving the connection to Wikidata of authority files from libraries in Italy and other parts of Europe
  • PKM - en - I'm interested in using tools to improve network connections among Wikidata items and between Wikidata and Commons, which makes it easier to identify data problems
  • so9q - en,sv - I'm interested in trying out mismatch finder and find mismatches with tools written in Python 😃
  • GoranSM - en,sr - Data Scientist for Wikidata. I would like to contribute any data products that might help us understand data quality in Wikidata!
  • Mike Peel (talk) - en (and a little es, pt) - I've added a possible session to the etherpad.
  • Oravrattas - en - interested in improving how we keep political data up-to-date
  • Bouzinac - fr, en - commited to improving transport data up-to-date (airports, subways...) and general consistency (finding/merging duplicates...)
  • Csisc - aeb, ar, fr, en, it - I am interested in improving the data validation of Wikidata through several novel methods.
  • Jura1 - en - TBD
  • ShiehJ- en - interested in best practices for applying property and constraints, data quality assurance workflow, the potential of Wikidata:Schemas for more granular user account control
  • LAP959- en - interested in ensuring stable, harmonious and robust data that are inclusive, easy to use and visualise
  • Akbarali- ml, ar, en – I would like to know more about how the Wikidata files can be used in academics.
  • Luckyz - en, it - TBD
  • Vladimir_Alexiev - en,bg,ru - I've made 3.5M edits and I'm very frustrated with WD's update performance. There can be no quality without the ability to easily make edits! https://phabricator.wikimedia.org/T290061
  • Kpjas - en,pl - I believe that referenced data is the crux of what Wikidata represents.
  • Lydia Pintscher (WMDE) - en, de - I'd love to discuss all things data quality, show what we already have and especially understand what's still missing.
  • Justin0x2004 - en - I'm interested in increasing modeling/representation uniformity.
  • MisterSynergy - en, de - countervandalism, patrolling, ORES; ideas how to discuss as well comprehensively document and visualize preferable data models for a given problem/field
  • Azertus - en, nl, fr - data quality, property constraints and documentation, meta documentation in general
  • Score Beethoven - en, nl, fr - data quality, data bulk input, output, scripts for manipulation, etc
  • Loz.ross - en, de, bg - interested in the use of Entity Schemas and related tools in maintaining data quality in Wikibase instances in general (as well as Wikidata); also: ontology standards; bulk input; synchronization across knowledge graph repos, etc
  • Jmkeil - de, en - interested in RDF data quality, especially RDF dataset comparison for quality evaluation and want to learn about the mismatch finder tool to get an idea of possible incorporation with my (in development) RDF dataset comparison tool
  • Ambrosia10 - en - I'm likely to just concentrate on actual editing, using tools such as the author disambiguator, mixnmatch and citeunseen gadgets to improve the quality of existing items. I'm interested in any other tools or gadgets I can use to improve the quality of existing Wikidata items.
  • Dnshitobu - en, dag -Interested in learning about data quality, bulk data modeling and how to organize government institutional data to allow for free but accessible information
  • Sradovsk - en - Interested in learning, period, through watching presentations, participating in hands-on activities.
  • Jelabra - es, en - I am interested in the use of ShEx and Entity Schemas in Wikidata
  • 99of9 - en - I want more identifier properties, which are all natural references and quality improvers. I very much use and value constraints, and want more of them. I want to be able to write a custom SPARQL query that will be run regularly at a set interval with the result tabulated in wikitext. I happily contribute and appreciate the quality contributions of others.
  • Daniel Mietchen - ru, fr, de, en - interested in (i) internal consistency (logically and across languages or knowledge domains), (ii) consistency with external resources, (iii) workflows to propagate curation events between Wikidata and external resources, (iv) tools, workflows and documentation of any of that
  • Mccoyle55 - en - TBD
  • Sbae2020 - en - TBD
  • Fuzheado - en - data quality on objects, specifically prints, photographs, status and reproductions, where we are not doing so well
  • Antoine2711 - fr, en - TBD
  • Memathieu - fr, en, es - TBD
  • Girassolei - pt, en
  • GiFontenelle - pt-br, pt, en, es - GLAM-related sessions, especially with bibliographic data.
  • Dbigwood - en - tools to use to ensure better bibliographic data
  • Hjfocs - it, fr, es, en - feedback loops with data providers and re-users
  • Aisha Khatun - en, bn - interested in understanding how ORES is used in Wikidata
  • Manuel Merz (WMDE) - en, de - I would like to get to know other people working on data quality and better understand what WMDE can do to help.
  • Oronsay - en - I am merging/adding P1889, adding gender to humans, doing mixnmatch and using the Bargioni add more identifiers script
  • Hsarrazin - fr, en - working on biographic data and portraits, authorities, book editions (cross-wiki with wikisource.