Wikidata:WikidataCon 2017/Notes/Data donations discussion

Title: Data donation discussion

Note-taker(s): MB-one

Speaker(s) edit

Name or username: Navino Evans

Contact (email, Twitter, etc.): navino histropedia.com twitter: @NavinoEvans

Useful links: Notes from previous session - more related links from there

Collaborative notes of the session edit

KEY QUESTIONS POSED

How to keep Wikidata in sync with an external data source

It’s hard to find which data has been changed since a previous import.

Lots of data processing is needed by people with advanced spreadsheet and/or coding skills.

There’s no way to easily report metrics related to a particular data import.

Each group chooses a questions and comes up with some ideas for:

What can we do now?

What should we work towards in th future?

Discussion about Topic 1: How to keep Wikidata in Sync with an externa data source?

experience/known problems

using translation services for labels and upload to wikidata

during the translation process, items could have changed

→ certain loss of data/information through that

annotated wd item with external database

external identifiers aren’t stable & wd identifiers arent stable

not clear if merging/deletion is appropriate

errors in wikidata project back to external database

mirrors went out of sync

splitting items into two makes for lenghty manual splitting process of links to this item

“wrongfully” matched (due to too few statements on certain items) items have to be unmatched

1) creation of csv for quickstatements, 2) creation csv for mixnmatch 3) creation of wikidata game to reconcile external entries to wd items ideas:

“lock” certain statements based on authority of the data source

get a feed of changes of certain properties (“Watch list for properties”)

better filters for watchlists → feed back to external db

checking diffs of wikidata items since last import → feedback

use a bot to feed changes of external db to wikidata

challenge: most scenarios don’t allow for bot editing

set up own sparql endpoints in external data source

What can we do now?

keep track of changes per item in wikidata

What are we wishing for?

Queries for recent changes to wd [UPDATE 12/11/2017 - New tool has just been released https://tools.wmflabs.org/wikidata-todo/sparql_rc.php]

Spreadsheet interface

Sandbox for ingesting the data before

mass revert of own edits in batch

there is the gadget Smart Rollback: https://meta.wikimedia.org/wiki/User:Hoo_man/Scripts/Smart_rollback

Discussion about Topic 3: Lots of data processing is needed by people with advanced spreadsheet and/or coding skills

experience/known problems
users confused, don't know how to proceed, which steps to take in which order, documentation is too general to provide sufficient guidance
knowledge about how successful projects prepared and uploaded data in a specific domain is not necessarily documented and available so that each new project must reinvent the wheel (even though it already exists)
Excel skills and skills in data matching, which do not necessarily involve coding, seem to be especially important
tool-centric approach as opposed to a user-needs-centric approach
skill levels of users are very different
datasets and processes differ by domain so a one-size fits all approach problematic
ideas:
systematic approach to gathering user feedback and observing users to see where exactly they are blocked, by domain and skill level, in order to address these first
encourage projects to document the steps they took (including non-technical steps), the people they involved and the methods they used, and make that information available in one easy-to-find place
adapt documentation to different user skill levels
ensure that crucial Excel skills (which anyone can learn) are identified and links to learning material provided (why not links to existing internet tutorials for the specific skills?)
offer specific spreadsheet training
documentation of domain-specific property matching (process)
centralize documentation for in one place
use cases and workflows should drive development
solutions will be domain specific
train network of data uploaders (inspired by Wikipedia model)

What can we do now?

most of the ideas can be put into place

What are we wishing for?

user centered, detailed instructions by domain, accompanied by training or easy access to training material centrally located so easy to find user feedback mecanism

Discussion about Topic 4: How to report metrics for data imports

experience/known problems

It's difficult to find the stats on where data has been used in infoboxes, once this is known page view stats are very easy

It's hard out how much of the dataset has changed since the import (i.e. what percentage of it still perfectly matches the source), and which parts have changed (so you can investiagte and reconcile between Wikidata and the Source)

What can we do now?

Find out which metrics are needed by external organisations

Setup a simple way to present the key metrics (relatively simple dev task using existing APIs)

Tool for showing recent changes to a data set [UPDATE 12/11/2017 - New tool has just been released https://tools.wmflabs.org/wikidata-todo/sparql_rc.php]

Organise all of the existing resources and add them to the relevant part of the data import guide

What are we wishing for?

A report button from the data import hub that allows you to see a detialed breakdown of all the relvant stats we can gather

Examples would inlude page views on Wikipedia by language, number of reference URLs linking to data source website (and number of those showing in Wikipedia), etc...

Overview of the session edit

What we can do now

We should develop the data import process in a centralised way, with people using it giving feedback and taking extra care to document their experience to share the probelms and solutions found. Lessons learned need to be reviewed and distilled back in to the documentation.

A simple thing we can do now is put togehter good resources for learnign the spreadsheet skills needed in data importing (which are relatively simple)

A wide range of metrics can already be reported, but we need to find out exaclty what data partner organisations need and all existing metrics resources should be rounded up and added to the data import guide.

A tool for showing recent changes to a data set from a query is greatly needed (and has since been provided by Magnus Manske: https://tools.wmflabs.org/wikidata-todo/sparql_rc.php). This is a huge help for keeping data in sync and metrics reporting. The new tool should be added to the relevant section of the data import guide. The other tool that's been highlighted is the gadget for rolling back a series of edits (https://meta.wikimedia.org/wiki/User:Hoo_man/Scripts/Smart_rollback). This is a vital tool to know about when importing medium or large amounts of data, so should be highlighted in the data import guide.

What we can do in the future

Documentation needs to be completely user centric, and broken down into different tiers by skill level. There should be a range of guides that are domain specific, which are both easy to find and easy to understand (at the skill level they are pitched to). All documentation and guides should be developed with a constant feeback mechanism, learning from the pain/mistakes of the past to improve the model.

The data import hub should be developed into a tool so it's easy to interact with with zero wiki text skills. All metrics reporting, synchronisation tasks etc should all be listed in this central place for each data import. Create a "one click" metrics reporting button avialable straight from the data import hub, giving all the data we can automatically find on exposure of the data set.

Develop a staging area where people can test their imports before 'going live' on Wikidata. This would ideally be a complete mirror of Wikidata, but with the 'tester user' having the data they mock importing show up in their own interface (maybe new items, or edits to existing items). Once you are happy it all looks good, you can click "Publish" to go ahead and update Wikidata.