The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
This RFC discusses how the data import process could be improved. I am closing this discussion as the state of affairs has changed significantly since then, with the revamping of the Wikidata:Dataset Imports portal and the release of various data import tools mentioned in the Wikidata:Data Import Guide. − Pintoch (talk) 17:50, 7 November 2018 (UTC)
An editor has requested the community to provide input on "Mapping and improving the data import process" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.
If you have an opinion regarding this issue, feel free to comment below. Thank you!
What would the ideal data import process look like? This page maps the current data import processes including tools and documentation, what is available and what is still needed to create an ideal process. Once this information has been mapped it can later be broken down into Phabricator tasks to keep easy track of progress.
We want to get as many community members as possible involved in these early stages of planning. The end goal is to have a well structured task list showing the improvements we need to make to the data import process as good as possible which we can turn into a set of Phabricator tasks.
This discussion follows on from the session held at WikidataCon 2017, which aimed to gather feedback from the community about problems/difficulties with the current data import flow.
The goals of the import process are outlined in each section, the overall goals of having a centralised data import process are:
Increase the volume and quality of data in Wikidata through:
Lowering the barrier to contribution to Wikidata for new contributors and experts through better documentation (e.g more Wikidata tours).
Tools to make it much easier to import data into Wikidata.
Providing resources for contributors to increase their skill level.
Providing a high quality service to data partners.
Understand topic completeness on Wikidata
Collating catalogues of datasets on different subjects into a central record, not all topics have Wikiprojects and many Wikiprojects have topic overlaps, collating all data involves many languages, data types and contributors.
Collating data imports into this same central record.
Creating Wikidata items for each database with a number of items in the database as a statement, allowing users to run this against the number of items on Wikidata.
Increase trust by organisations and individuals reusing Wikidata including other Wikimedia projects through:
Providing clear information of how data is added to Wikidata and where it comes from.
Providing clear information Wikidata's shortcomings and the development roadmap for new tools and documentation to improve data quality.
Having a much higher percentage of referenced statements, this could be done through making it policy to reference statements wherever possible when importing datasets.
Being transparent about the data import process with documentation of decisions made when importing a dataset.
Linking between import documentation and items and vice versa.
More linking to other Wikimedia projects e.g Commons Categories.
Other Wikimedia contributors are more easily able to contribute to Wikidata and gain value from doing so.
Increase the number of organisations and individuals contributing to Wikidata through:
Providing information on the purpose and potential of contributing and using Wikidata.
Provide metrics on data reuse.
Increase the number of organisations and individuals reusing data from Wikidata through:
More complete and more topic specific documentation and easier tools to reuse data including using Wikidata as authority control.
More easy to understand and attractive way to explore data on Wikidata.
Some resources for data partners exist but are limited, there are no step by step instructions for publishing open data or how to get the data imported into Wikidata. The laws around copyright and other data rights is complicated and not well understood by both many Wikidata contributors and data partners.
Data producers have good understanding of the purpose and potential of contributing and using Wikidata. Many organisations use Wikidata as an authority control to make their data Linked Open Data. There is information available for different audiences interested in Wikidata e.g data science students, librarians, museums, galleries, archives. There is a more easy to understand and attractive way to explore data on Wikidata.
There is very little coordination of data imported on a specific topic and few records of what has and has not been imported on a subject. It's difficult for an external organisation to find out whether their data is within the scope of Wikidata.
A place to map all the data available on a subject and systematically import it into Wikidata, to know what is available and what is missing, including multiple external databases. E.g for having a worldwide list of built heritage on Wikidata (very useful for Wiki Loves Monuments) all national and regional heritage registers would need to be imported.
More consistent use of the data import hub so we have a record that can be searched through to identify if someone is already working on importing a dataset (and which Wikidata users / organisations are involved)
A Wiki project or task force that can be the first contact point for external organisations who need to find out which parts (if any) of their data set are notable enough for Wikidata. Ideally we need a non-wiki alternative for this, like a mailing list email address or even a person who can be contacted by phone/Skype.
A first version of a central record of all datasets imported exits ( Data Import Hub) but it isn't used. Wikidata Import Hub has a notes section, a place to plan the structure of data imports and a place for users to collaborate on data imports.
The current system should be ok for a while, but will become very difficult to manage as the number of entries listed increases. It's also not easy for someone without knowledge of wiki text to start and edit a data import hub entry.
Property proposals often take a long time to be accepted or rejected.
A central recording mechanism for all external datasets imported into Wikidata. It is easy to start, manage and provide topic knowledge on a data import, so we cooperate and can capture the knowledge external data partners and other subject experts. There is an easy way to extract databases from external sites if they do not offer spreadsheet downloads.
A series of incomplete resources to help people learn how to import data into Wikidata. Tools require a high technical ability and have limited documentation. Bot requests often take a long time time to be assessed and done.
Significantly lowered technical barriers to upload data. A semi automated way of keeping existing imports synchronised with the external database. Any manual work needed is minimised where possible, clearly described and easy to contribute to.
Documentation on manual work needed on each dataset which is easy to find and goes into a central list of certain kinds of tasks
Develop a staging area where people can test their imports before ‘going live’ on Wikidata. This would ideally be a complete mirror of Wikidata, but with the additional data being tested for import showing up in the interface. Once you are happy it all looks good, you can click “Publish” to go ahead and update Wikidata for real.
There is no guidance on maintaining data quality, it is left to each user to invent sparql queries to check the data with no list of possible queries. Recently Magnus wrote the recent changes tool to show changes in the results of a sparql queries. Knowing how to create and run sparql queries is required to check data quality.
Data quality is maintained and improved over time using a set of tools with a low technical barrier to use. Users are track changes in data that has been imported to understand what has changed and fix errors introduced. Errors are less likely to be introduced in the first place. Data is integrated with information from other Wikimedia projects e.g attaching a Commmons category to the item. It is easy to find and repair vandalism. There is effective dispute resolution for disagreements about items. There are easy to use processes to assess and improve item quality.
There is lack of understanding, trust and feeling of control and agency of Wikidata on some other Wikimedia projects which is preventing integration especially on English Wikipedia. Some resources are available but many are missing or incomplete. There are incomplete instructions on how to get started contributing to Wikidata.
Trust within other Wikimedia project contributors is much higher. Other Wikimedia contributors are more easily and more frequently contribute to Wikidata. Other Wikimedia projects use Wikidata widely and gain value from doing so.
Have consistency in the way we mark when Wikipedia gets used as a reference for statement, so that Wikipedia can decide against using the data in their infoboxs (Wikipedia does not like Wikipedia as a source), possibly partially resolved through this
Find some way to deal with data about people which is acceptable to other Wikimedia projects, possibly partially resolved by Wikidata:Living_persons_(draft).
Tools, Documentation, Community communication
Resolve issues with Wikipedia watch lists showing changes to Wikidata data that is used on a page.
Show the value of Wikidata fed visualisations on different language Wikipedias
Tools, Documentation, Community Communication
Provide clear instructions on how to construct Wikidata fed infoboxes for other Wikimedia projects with examples.
Help other Wikimedia projects to find gaps in coverage of a subject, missing or incorrect categorisation, missing or incorrect uses of templates using the data imported, etc, e.g using Listeria lists e.g List of Royal Academicians.
To be able to generate a report on data added and where it used across Wikimedia projects. This will especially useful for partner organisations who want to understand how widely their data is used, including in languages they usually do not reach.
In the long term we should have a single tool that allows you to generate all available metrics, but the initial goal should be to gather together all of the existing resources and add them to the data partnership pages. A new "metrics" chapter should be added so that we have a place to put these resources, and some instructions on how to use them.
What facts come from a certain source
Where they are used across Wikimedia projects
How many people see those pages on Wikimedia projects
The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.