Wikidata:Requests for comment/Mapping and improving the data import process
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
This RFC discusses how the data import process could be improved. I am closing this discussion as the state of affairs has changed significantly since then, with the revamping of the Wikidata:Dataset Imports portal and the release of various data import tools mentioned in the Wikidata:Data Import Guide. − Pintoch (talk) 17:50, 7 November 2018 (UTC)[reply]
An editor has requested the community to provide input on "Mapping and improving the data import process" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.
If you have an opinion regarding this issue, feel free to comment below. Thank you! |
What would the ideal data import process look like? This page maps the current data import processes including tools and documentation, what is available and what is still needed to create an ideal process. Once this information has been mapped it can later be broken down into Phabricator tasks to keep easy track of progress.
We want to get as many community members as possible involved in these early stages of planning. The end goal is to have a well structured task list showing the improvements we need to make to the data import process as good as possible which we can turn into a set of Phabricator tasks. This discussion follows on from the session held at WikidataCon 2017, which aimed to gather feedback from the community about problems/difficulties with the current data import flow. |
GoalseditThe goals of the import process are outlined in each section, the overall goals of having a centralised data import process are:
|
Phabricator task structureeditA suggested structure for the Phabricator task list is a single root project called Data partnerships process with three main subprojects.
|
Import process overvieweditThe process of importing external datasets into Wikidata can be broken down into the following steps, tools and documentation could combine two or more of these steps:
|
1. Provide information about Wikidata to data partnerseditCurrent situationeditSome resources for data partners exist but are limited, there are no step by step instructions for publishing open data or how to get the data imported into Wikidata. The laws around copyright and other data rights is complicated and not well understood by both many Wikidata contributors and data partners. Existing resourcesedit
GoalseditData producers have good understanding of the purpose and potential of contributing and using Wikidata. Many organisations use Wikidata as an authority control to make their data Linked Open Data. There is information available for different audiences interested in Wikidata e.g data science students, librarians, museums, galleries, archives. There is a more easy to understand and attractive way to explore data on Wikidata. Resources needededit
|
2. Identify data to be importededitCurrent situationeditThere is very little coordination of data imported on a specific topic and few records of what has and has not been imported on a subject. It's difficult for an external organisation to find out whether their data is within the scope of Wikidata. Existing resourceseditGoalseditA place to map all the data available on a subject and systematically import it into Wikidata, to know what is available and what is missing, including multiple external databases. E.g for having a worldwide list of built heritage on Wikidata (very useful for Wiki Loves Monuments) all national and regional heritage registers would need to be imported. More consistent use of the data import hub so we have a record that can be searched through to identify if someone is already working on importing a dataset (and which Wikidata users / organisations are involved) Resources needededit
|
3. Plan and record data importseditCurrent situationeditA first version of a central record of all datasets imported exits ( Data Import Hub) but it isn't used. Wikidata Import Hub has a notes section, a place to plan the structure of data imports and a place for users to collaborate on data imports. The current system should be ok for a while, but will become very difficult to manage as the number of entries listed increases. It's also not easy for someone without knowledge of wiki text to start and edit a data import hub entry. Property proposals often take a long time to be accepted or rejected. Existing resourceseditGoalseditA central recording mechanism for all external datasets imported into Wikidata. It is easy to start, manage and provide topic knowledge on a data import, so we cooperate and can capture the knowledge external data partners and other subject experts. There is an easy way to extract databases from external sites if they do not offer spreadsheet downloads. Resources needededit
|
4. Matching data to existing data on WikidataeditCurrent situationeditTools exist to match data in Wikidata, but automatic matching is sometimes wrong Existing resourcesedit
GoalseditMatching of data to existing Wikidata items is faster, easier. If an external database uses other external identifiers these can be used to match the data e.g ISO codes. Resources needededit
|
5. Data importingeditCurrent situationeditA series of incomplete resources to help people learn how to import data into Wikidata. Tools require a high technical ability and have limited documentation. Bot requests often take a long time time to be assessed and done. Existing resourceseditGoalseditSignificantly lowered technical barriers to upload data. A semi automated way of keeping existing imports synchronised with the external database. Any manual work needed is minimised where possible, clearly described and easy to contribute to. Resources needededit
|
6. Maintain and improve data qualityeditCurrent situationeditThere is no guidance on maintaining data quality, it is left to each user to invent sparql queries to check the data with no list of possible queries. Recently Magnus wrote the recent changes tool to show changes in the results of a sparql queries. Knowing how to create and run sparql queries is required to check data quality. Existing resourcesedit
GoalseditData quality is maintained and improved over time using a set of tools with a low technical barrier to use. Users are track changes in data that has been imported to understand what has changed and fix errors introduced. Errors are less likely to be introduced in the first place. Data is integrated with information from other Wikimedia projects e.g attaching a Commmons category to the item. It is easy to find and repair vandalism. There is effective dispute resolution for disagreements about items. There are easy to use processes to assess and improve item quality. Resources needededit
|
7. Usage of data on other Wikimedia projectseditCurrent situationeditThere is lack of understanding, trust and feeling of control and agency of Wikidata on some other Wikimedia projects which is preventing integration especially on English Wikipedia. Some resources are available but many are missing or incomplete. There are incomplete instructions on how to get started contributing to Wikidata. Existing resourcesedit
GoalseditTrust within other Wikimedia project contributors is much higher. Other Wikimedia contributors are more easily and more frequently contribute to Wikidata. Other Wikimedia projects use Wikidata widely and gain value from doing so. Resources needededit
|
8. Reporting on data usageeditCurrent situationeditNo current documentation on how to do this, not sure if it needs additional tools? Existing resourcesedit
GoalseditTo be able to generate a report on data added and where it used across Wikimedia projects. This will especially useful for partner organisations who want to understand how widely their data is used, including in languages they usually do not reach. Resources needededit
|