Wikidata:Requests for comment/Mapping and improving the data import process

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

This RFC discusses how the data import process could be improved. I am closing this discussion as the state of affairs has changed significantly since then, with the revamping of the Wikidata:Dataset Imports portal and the release of various data import tools mentioned in the Wikidata:Data Import Guide. − Pintoch (talk) 17:50, 7 November 2018 (UTC)[reply]

An editor has requested the community to provide input on "Mapping and improving the data import process" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

What would the ideal data import process look like? This page maps the current data import processes including tools and documentation, what is available and what is still needed to create an ideal process. Once this information has been mapped it can later be broken down into Phabricator tasks to keep easy track of progress.

We want to get as many community members as possible involved in these early stages of planning. The end goal is to have a well structured task list showing the improvements we need to make to the data import process as good as possible which we can turn into a set of Phabricator tasks.

This discussion follows on from the session held at WikidataCon 2017, which aimed to gather feedback from the community about problems/difficulties with the current data import flow.

Goals

The goals of the import process are outlined in each section, the overall goals of having a centralised data import process are:

Increase the volume and quality of data in Wikidata through:
- Lowering the barrier to contribution to Wikidata for new contributors and experts through better documentation (e.g more Wikidata tours).
- Tools to make it much easier to import data into Wikidata.
- Providing resources for contributors to increase their skill level.
- Providing a high quality service to data partners.

Understand topic completeness on Wikidata
- Collating catalogues of datasets on different subjects into a central record, not all topics have Wikiprojects and many Wikiprojects have topic overlaps, collating all data involves many languages, data types and contributors.
- Collating data imports into this same central record.
- Creating Wikidata items for each database with a number of items in the database as a statement, allowing users to run this against the number of items on Wikidata.

Increase trust by organisations and individuals reusing Wikidata including other Wikimedia projects through:
- Providing clear information of how data is added to Wikidata and where it comes from.
- Providing clear information Wikidata's shortcomings and the development roadmap for new tools and documentation to improve data quality.
- Having a much higher percentage of referenced statements, this could be done through making it policy to reference statements wherever possible when importing datasets.
- Being transparent about the data import process with documentation of decisions made when importing a dataset.
- Linking between import documentation and items and vice versa.
- More linking to other Wikimedia projects e.g Commons Categories.
- Other Wikimedia contributors are more easily able to contribute to Wikidata and gain value from doing so.

Increase the number of organisations and individuals contributing to Wikidata through:
- Providing information on the purpose and potential of contributing and using Wikidata.
- Provide metrics on data reuse.

Increase the number of organisations and individuals reusing data from Wikidata through:
- More complete and more topic specific documentation and easier tools to reuse data including using Wikidata as authority control.
- More easy to understand and attractive way to explore data on Wikidata.

Phabricator task structure

A suggested structure for the Phabricator task list is a single root project called Data partnerships process with three main subprojects.

Documentation
Tools
Community communication

Import process overview

The process of importing external datasets into Wikidata can be broken down into the following steps, tools and documentation could combine two or more of these steps:

Provide information about Wikidata to data partners
Identify data to be imported
Plan and record data import
Match data to existing data on Wikidata
Import data
Maintain and improve data quality
Usage of data on other Wikimedia projects
Reporting on data usage

1. Provide information about Wikidata to data partners

Current situation

Some resources for data partners exist but are limited, there are no step by step instructions for publishing open data or how to get the data imported into Wikidata. The laws around copyright and other data rights is complicated and not well understood by both many Wikidata contributors and data partners.

Existing resources