Wikidata:WikidataCon 2017/Notes/Data import: An overview of the current system, and idea exchange for the future direction

Title: Data import: An overview of the current system, and idea exchange for the future direction.

Speaker(s) edit

Name or username

Navino Evans
Stuart Prior (standing in for John Cummings)

Contact (email, Twitter, etc.)

email: navino@histropedia.com, twitter:@NavinoEvans
twitter: @StuartPrior2

Useful links

Abstract edit

This session is a workshop intended to bring together Wikidata wizards and enthusiasts who can help resolve some of the major pain points in the data import process. For example:

It’s very difficult to keep Wikidata ‘in sync’ with an external data source (especially when they have no unique ID system!).
There’s no way to easily report metrics related to a particular data import.
It’s hard to find which data has been changed since a previous import.
Lots of data processing is needed by people with advanced spreadsheet and/or coding skills.

It will be framed around a presentation section, where we will share our experience of importing data from UNESCO, and show the data import documentation that has been created as a result.

Tool development will obviously play a huge part in solving the problems with data import. Central to our discussion will be where existing tools fit into the process (e.g. QuickStatements, Mix’n’Match, the Wikidata Query Service), and how things may need to change when GLAMPipe is ready to become the hub for data imports.

The hope is that we can come away from this session with a plan that will help shape future decisions about the data import process, documentation and related tools.

Collaborative notes of the session edit

Key questions posed in the session for next discussion and feedback session:

How to keep Wikidata in sync with an external data ource.
It’s hard to find which data has been changed since a previous import.
Lots of data processing is needed by people with advanced spreadsheet and/or coding skills.
There’s no way to easily report metrics related to a particular data import.

Questions / Answers edit

Point: Data could be split in to different types based on whether it will 'expire' or not. I.e. some data we do not expect to change any time soon such as "high of mount everest", but some data will need replacing.
- Speaker: An interesting idea to explore... Note that in Wikidata we will normally not actually replace the data but add a new statements with qualifier to show the date it's true (such as "population of a city")

Point: Having some sort of "badge" to show users a level of completeness/trustworthyness of the data
- Speaker: This overlaps with the current work on "signed statements" which is at an early stage of development. Will allow us to put a visible badge on a statement to say it's come from a particular approved source (e.g. UNESCO badge for statments about World Hertiage data)

Question: Is there a centralised list of all these issues, organised and structured so that people can collaborate and work trhough fixing them all (as you would on a big IT project)?
- Speaker: That is what we really need, but discussions are very decentralised at the moment. Feedback and bug reporting on the tools generally goes to Magnus (the main import tool volunteer developer) and other data import have been scattered.
- After thought 11/11/2017: We could create a central root phabricator item with lots of subtasks for the entire data import process so that we can centralise discussions.

Point: Don't think that the issue is the technical ability, as there are typically people with the necessary skills in the organisation wanting to import. For them the issue is "how do you know if you can import the data", will the community accept this large import. In other words, the "needing more guidance is the issue, not the technical ability". Also, some sort of staging area would really help with this.
- Speaker: Agree that technical skills not the issue for big scale data import, but having lots more editors with spreadsheet and quickstatements skills is really needed as a lot of the jobs only need this level of skill. Difficult to get community consensus because of the small particpation in the discussion, but always need to start in project chat and look for feedback before doing any work.
- After thought 11/11/2017: I should have also mentioned that the the data import hub and guide is the attempt at providing that guidance. I also feel we need to have a paid expert or two that can be contacted in a tradtional non-wiki way (phone, email etc). Also staging are is a great idea!

Question: Where do you see the difficulty in keeping synchronised. Surely you can just use the data of retrieval so find what to update as new data is released?
- Speaker: You can in some cases, but often the external data set will not have unique identifiers and changes are made on Wikidata, making it difficult to reconcile a new list with existing spreadsheet especially if the strcture of the data on Wikidata has changed in the meantime (from community discussion/editing). Data source can vary as well, may be from a downloadble .csv file, or scraped from a web page etc. All very possible but you need to write a bot, and the list of bot request is long. GLAM Pipe tool shows a lot of promise that it can help allow non coders to do the processing required

Point: There are already aggregation projects that deal with these problems on a day to daya basis, e.g. Europeana. These sort of projects have people who would be very happy to join in the discussion and share their experise.
- Speaker: A great point! we will reach out and with these questions.

Point: (Jens) Question 4 (no easy way to report metrics) is not as much of an issue as you might think as ther are already lots of APIs for page views and connected Wikipedia pages etc. What would be very useful is if the organisations giving data could report what kind of metrics they need, so that the community and tool developers know which metrics to include.
- Speaker: sounds like this point may be solved to some extent then. The one major issue that prompted this question is "How can we get a list of all infoboxes (in any language Wikipedia) using this data set?" - we were aware that once we know this we can easily find the articles using it and then generate page veiw statistics over any time period, but could find an automatic way to determine the list of infoboxes. Are you aware of any way this can be done yet?
- Response: I will look into it and get back to you.
- Speaker: Third party usage metrics can't be found automatically, but we should at least all be listing things we find in a central place so it's easy to find out where data we've imported is likely to be used

Point: Sharing experience of first edits to Wikidata being a mass import Austrian politicians into Wikidata (using "Wikidata integrator"). There was a big issue with not overwriting/updating when the statements where already there (creates duplicate statements). If there was the staging area menioned it would have enabled him to find the issue and save himself fixing it all by hand. It would be amazing if you had a web interface what to do with various data sources, whether to overwrite or duplciate etc.
- Speaker: GLAMPipe promises to be able to do this, but it's still in development so may be a while before it works as expected. Also note that QuickStatements2 (https://tools.wmflabs.org/quickstatements/) has a big upgrade on version 1 in this area. It now detects duplicate statements and is able to combine them where necessary.

Point: (Stuart Prior) What about the questions "should we upload the data" (we've spoken about the "can we do it"). How do you know if you should as a third party? Not easy to engage with community in wiki for an external person. This is a real technical barrier for data import, ie. understanding how the community works. This should be better documented.

Point: We should consider a custom wikimedia spreadsheet system, similar to Google spreadsheet but without the limitation of closed data etc. It could have all of the data import functions built in.

Overview of the session edit

Speaker

The current process is too difficult for most editors to help with.
We need to remove some of the barriers to entry to encourage more data donation, and provide better ways of reporting metrics to justify the impact of the donation
We need a short term set of solutions to the key problems, and a long term solution to work towards.

Audience

There are many experts in different organisations who can help, we need to reach out to them.
A major issue is knowing whether you can import some data (i.e. it's hard to know how Wikidata and the community works as an outsider)
Many expressed desire for a "staging area" where you could test and explore your import before going live onto Wikidata.

Summary from discussion session that followed, covering the key questions posed in this session edit

What can we do now? edit

We should develop the data import process in a centralised way, with people using it giving feedback and taking extra care to document their experience to share the probelms and solutions found. Lessons learned need to be reviewed and distilled back in to the documentation.

We can put together good resources for learning the relatively basic spreadsheet skills needed for data importing. A wide range of metrics can already be reported, but we need to find out exactly what data partner organisations need, and all existing metrics resources should be rounded up and added to the data import guide.

A tool for showing recent changes to a data set from a query is greatly needed (and has since been provided by Magnus Manske: https://tools.wmflabs.org/wikidata-todo/sparql_rc.php). This is a huge help for keeping data in sync and metrics reporting. The new tool should be added to the relevant section of the data import guide. The other tool that's been highlighted is the gadget for rolling back a series of edits (https://meta.wikimedia.org/wiki/User:Hoo_man/Scripts/Smart_rollback). This is a vital tool to know about when importing medium or large amounts of data, so should be highlighted in the data import guide.

What should we work towards in the future? edit

Documentation needs to be completely user centric, and broken down into different tiers by skill level. There should be a range of guides that are domain specific, which are both easy to find and easy to understand (at the skill level they are pitched to). All documentation and guides should be developed with a constant feeback mechanism, learning from the pain/mistakes of the past to improve the model.

The data import hub should be developed into a separate tool so it's easy people to interact with it without needing wiki text skills. All metrics reporting, synchronisation tasks etc should all be listed in this central place for each data import.

Create a "one click" metrics reporting button avialable straight from the data import hub, giving all the data we can automatically find on exposure of the data set. Develop a staging area where people can test their imports before 'going live' on Wikidata. This would ideally be a complete mirror of Wikidata, but with the 'tester user' having the data they mock importing show up in their own interface (maybe new items, or edits to existing items). Once you are happy it all looks good, you can click "Publish" to go ahead and update Wikidata.