Wikidata:WikiProject Wikidata for research/Meetups/2015-02-23-Berlin/Notes

Contributions to the "Wikidata for research" project (including Wikidata:WikiProject Wikidata for research and all its pages) are dual licensed under CC BY-SA 3.0 (the Wikimedia default) and the Creative Commons Attribution 4.0 license.
Contributions by the project to the item and project namespaces of Wikidata shall be under CC0.

Notes edit

‘’If possible, please use wiki markup, so that the notes can be easily transfered to Wikidata afterwards.’’

This doc is for notes taken during https://www.wikidata.org/wiki/Wikidata:WikiProject_Wikidata_for_research/Meetups/2015-02-23-Berlin .

Participants edit

Please sign up at https://www.wikidata.org/wiki/Wikidata:WikiProject_Wikidata_for_research/Meetups/2015-02-23-Berlin#Participants .

Introductions edit

Gregor Hagedorn, Museum für Naturkunde Berlin, Science Programme “Digital World and Information Science” (Welcome, Anthropocene)
Daniel Mietchen, Museum für Naturkunde Berlin, Science Programme “Digital World and Information Science”
Jeff T University of California, Los Angeles
David Fichtmüller, Botanic Garden and Botanical Museum Berlin-Dahlem, Biodiversity Informatics Research Group
Anton Güntsch, Botanic Garden and Botanical Museum Berlin-Dahlem, Biodiversity Informatics Research Group
Falko Glöckler, Museum für Naturkunde Berlin, Science Programme “Digital World and Information Science”
Benjamin Karran, FU Berlin, Ebekebe
Magnus Knuth, HPI, Ph.D., also DBpedia
Harald Sack, HPI, Semantic Technologies, DBpedia, Knowledge Engineering
Christoph Bruch, Helmholtz Open Science Collaboration, Policy department
Succu, Wikidata Community
Jan Brase, prior leading DataCite, now: German National Library of Science and Technology (TIB)
Markus Krötzsch, leader of Research Group in Dresden.
Lydia Pintscher, Wikidata product manager
Rene Pickhardt, Ph.D. Student with Steffen Staab; Koblenz
Claudia Müller-Birn, FU Berlin

Lydia’s talk edit

Links from the talk edit

from Q&A edit

Several ways of interacting with wikidata:

through wiki data api: https://www.wikidata.org/w/api.php
offline bulk process: https://github.com/Wikidata/Wikidata-Toolkit
Wikidata Query Tool: https://wdq.wmflabs.org/
http://wikiba.se/

Goal of wikidata is not to just include as much data as possible but to make it in a way that is accepted by the community. So pumping in all the related-work.net data would not be welcome. Better to include data that is cited within Wikipedia to jumpstart it.

Links to larger DFG-funded projects edit

GFBio: http://www.gfbio.org/
BiNHum: http://wiki.binhum.net/ : central coordination activity plus task groups around it.
LIS: allows larger partnerships.

Paketantrag: may fail because of one partner failed, no big advantage to have package.

Markus K und Harald S: Forschungsgruppe has to be well-defined, SINGLE strong research challenge/topic at the core, difficult to find around Wikidata.

1 - 2 - 4 - all edit

Individually take 10 Minutes to develop one (or two) ideas suitable for a large collaborative DFG proposal. Do this in this document, in the numbered list below, in a single paragraph.
After 10 minutes, discuss your proposal with your neighbor and rank your 2 Projects.
After another 10 Minutes, discuss it with 4 people and rank the proposals.. Select one presenter for presenting your ideas to the whole group.

Lydia - Improving and maintaining data quality in an ecosystem
1. Based on data already in Wikidata check if the referenced sources are actually saying what they reference. If no reference is given try to find one and propose it to the editor. This can for example work based on identifiers already present in the item.
2. Analyse the data we already have and identify biases in it. Find ways to help the community become aware of them and mitigate them where appropriate. This can for example be done very effectively by visualisations like https://tools.wmflabs.org/render/toolkit/WikiMap/.
3. Investigate how to best highlight great uses of Wikidata data in order to encourage more use and re-use and establish best practices with regards to feeding back improvements, highlighting Wikidata’s contribution and other interactions with the Wikidata community.
4. Build more and different games to encourage new kinds of contributions to Wikidata. How can games be used to keep the data quality in Wikidata high?
5. Investigate ways to automatically identify bad edits based on machine learning for example. How can existing tools for Wikipedia be adapted from working on text to working on structured data?
6. Notification service
Rene - How to centralize a decentral platform:
1. Wikidata is currently trying to unify various data sets that are entered into different installations of mediawiki software. (Countries, different products [wikipedia, commons, books, source,...], versions [most installations took on their own life with own extensions being installed]) Eventually the goal would be to have a central data store which is used by the many mediawiki installations to first pull the data from the knowledge base and display it in the mediawiki installation but also second to edit the central repository while editing a particular item in one of the mediawiki installations.
2. To achieve this ultimate goal a lot of complex social and technical issues have to be resolved.
  - How to gain and maintain the trust of the editing community?
  - How to design interfaces that have a low barrier for participation yet still transparency of who entered what data?
  - How to integrate existing work and software stacks (e.g. the templating system)
  - How to scale the technologies to support this amount of data access?
  - What are the social and political barriers that might put a risk on such an endeavor?

1. - - support crowdsourcing especially author disambiguation of research paper
    - Reproduce google knowledge graph templates. Who am I? Entity classification / gamification http://moresemantic.blogspot.de/2014/12/the-fact-ranking-challenge-continues.html
    - support the community in ontology creation of wikidata
    - how to centralize a decentralized platform?
    - SPAM detection and provenance
Daniel - Wikidata and Wikibase in biodiversity research:
1. identification keys that ask “the right” (i.e. most relevant) questions
2. More games (think 20Q game)- have (groups of) people answer a series of questions, feeding the results back (possibly with some filtering) into Wikidata
3. linking all taxonomic concepts to the literature items in which they were defined
4. an - ideally user configurable - “bird’s eye view” on taxonomic relationships, i.e. for butterflies, I prefer the classification according to reference A, for moths according to reference B
5. identify in a given text statements already known in Wikidata
6. Wikidata-based alert mechanisms (similar to Flow notifications/ watchlists, but covering mentions of external IDs too)
7. Wikibase platforms
  1. one item per tree in Berlin
  2. one item per museum specimen
Jeff T - YOUR TITLE:
1. Represent disjointness in Wikidata. For example, something can’t be a Mammal and a Reptile For a claim to be falsifiable, this allows a claim to be wrong.
2. Wikibase instance as an extension of live Wikidata properties and subclasses/ontology. The extension can add items in a new namespace like X12345.
David - Mediawiki installations for research projects are already quite common, wikibase installation would be the next logical step for project/data documentation:
1. Potential Use Cases from my point of view, in regards to Wikibase:
  1. ontology/terminology/vocabulary building for research communities
  2. Project (Meta) Data and Data description, maybe also semantic annotation (though unsure how this could work)
2. with regards to Wikidata
3. to be used for data quality purposes, e.g. cross referencing names of places, countries, ...
Anton - YOUR TITLE:
1. From a biodiversity research perspective I am very interested to expose biodiversity data (e.g. collections, observations, descriptions, distributions) in a LOD-compliant way. The problem I am struggling with is to find cross- domain research questions which could be tackled once enough LOD has been mobilised. We need a kind of “man on the moon” challenge we can align our activities to. Ideally we would
  1. find an interesting interdisciplinary research question which can potentially be tackled with LOD
  2. find communities willing to expose their data to support working on this question
  3. use 1 and 2 to start a DFG-funded LOD/WikiData-Initiative.
2. The funding would be used to
  - mobilise data,
  - providing tools for connecting up databases,
  - capacity building workshops for both data providers and researchers who would like to “exploit” the new “information space”.
3. A core coordination project will be accompanied by set of additional projects working on particular technical aspects or bringing in additional communities.
Falko - Community ecology (synecology)
1. enrich wikidata with (syn-)ecological data (traits like biological interactions, but also habitat information)
2. ecological analyses via wikidata could be the driving scientific approach
3. crowdsourcing by citizen scientists could be a major method for data enrichment
4. additionally, one could understand linked data as a kind of “ecosystem”, which would be sociologically interesting
Benjamin - YOUR TITLE:
1. How can we use insights from community based ontology development projects to improve existing methods and tools?
2. How can we define data quality and measure it?
3. Fitness for use
Magnus - YOUR TITLE:
1. Automatic data extraction is well suited for knowledge extraction in a large scale though the performed quality is often quite poor.
2. How can such approaches benefit from corrections made by the users, e.g. on a platform as Wikidata. Equally, researchers could benefit from corrections on their data made by the community.
3. In order to do so, it is necessary to bring the data to the users in an easily understandable and appealing format, i.e. aggregated, embedded in contextual information, integrated in visualizations.
4. The project could therefore foster the creation of user interfaces that link back to the original data in order to create feedback loops. Such feedback would not necessarily demand the correction but could target at data that is suspicious to be incorrect.
5. Questions:
  1. How can I monitor data that I put into Wikidata or that seems relevant for me? …
  2. Wikidata for scientists: Do researchers need special interfaces?
Harald - YOUR TITLE:
1. Interfacing scientific communities (databases) with wikidata (community science)
  1. Mediation / inconsistency detection / plausibility checking
  2. collaborative ontology design / ontology mediation
Christoph - YOUR TITLE:
1. Partners: Research organisations, Data users/improvers
2. Aims:
  1. Boost usage of open data;
  2. research conditions for successful collaboration of research organisations and data users;
  3. identify corresponding business cases;
  4. illustrate benefits of open data use
3. Possible collaboration partners within Helmholtz Association
  1. Alfred-Wegener-Institute (Expedition, PANGEA)
Jan - Enriching library catalogues with community/citizen science : ##From TIB’s perspective we would have a huge interest to find a way to connect WikiData with our Library Catalogue. We could establish a prototype for one of our disciplines and enrich the search results in our catalogue with facts, statements and links form WikiData. ##From the DataCIte side there is of course the interest and option to provide DOI names for data and other scientific outputs from WikiData.
Markus -
1. Forschergruppe
  1. Emerging ontologies (Input)
  2. Entity classifier/Wikidata Quiz (Use cases)
  3. language design, collaborative o. authoring, reasoning-assisted crowdsourcing o., o. extraction/learning and ranking, meta-modelling, o-NL mapping
2. Infrastruktur
  1. Semantic publishing cooperations with science journals
  2. Develop standards for deep linking of research papers
  3. Establish processes for direct data exchange with publishers
  4. Alert mechanism for new research results based on topics
Gregor - Wikidata for scientists - learning Open from the citizens:
1. Most information management systems used in professional science are based on rigid authorization control mechanisms rather than on mechanisms fostering transparency, trust. Important further topics are user-accessible versioning, a balance between the option for dissent and mechanisms that foster consensus building. Science as a whole has developed a publication system for natural language based journal publications, which implements openness, transparency, which allows for dissent and - through citation forces - encourages consensus building. Project will develop a new system for a domain based on the software behind Wikidata (Wikibase).
2. Which advantages or disadvantages does such a professional science information system built this way.
3. Cheaper, flatter learning curve, what is new?
4. interoperability.
5. Concrete examples at MfN: Traits and uses of organisms (Artenquiz), Fauna Europaea,
Vladimir (remote) - Wikidata for Coreferencing Authority Control
1. The report http://vladimiralexiev.github.io/CH-names/README.html#sec-1 shows that when it comes to name data sources, maybe the two that matter are VIAF and Wikidata. Their name coverage is fairly orthogonal. VIAF has more name variations and permutations, Wikidata has more translations. VIAF is much bigger: 35M persons/orgs. Wikidata has 2.7M persons and maybe 1M orgs. Only 0.5M of Wikidata persons/orgs are coreferenced to VIAF, with maybe another 0.5M coreferenced to other datasets, either VIAF-constituent (eg GND) or non-constituent (eg RKDartists).

A lot can be gained by leveraging coreferencing across VIAF and Wikidata: finding errors in Authority files, finding merge candidates in Wikidata, promulgating identifiers... Wikidata has great tools for crowd-sourced coreferencing. The holy grail of every GLAM worker “Sum of All People, with links to their Works” is coming about! But we’re just at the start of a lot of work in that direction. See https://twitter.com/hashtag/coreferencing for involving Getty, British Museum thesauri, and some fancy shots. I started Wikidata:WikiProject Authority control but I don’t know how to “run” it. I recently wrote up an ODI Culture Challenge proposal GLAM-WIKI on Steroids (badly) but it wasn’t sucessful.

Concluding discussion: edit

Markus K.: “Worth knowing:” Understanding wikidata. What is most relevant for wikidata: Google knowledge graph does not show everything, different things important for different people. Method applications, research questions. If you make a question, which questions do you ask? What is interesting about George Clooney. Ranking facts with respect to usage, memorability, fitness for purpose???
Harald: WhoKnows? [REF] quiz from DBpedia, but problem what to ask, popularity, what is referenced most often (entity, property) - Exploratory search, fact ranking, game: “who knows” like who wants to be a millionaire. What are the most important facts?
Harald: Indicator for properties the values of which change often - contentiousness.
Magnus: In which way would wikidata 4 science : special APIs for data, how to improve wikidata/wikibase, so that more useful and more interactions.
- Stakeholders: Science organisations from different domains, citizen scientist
- Add a new user interface. wikibase extension for scientists,

Action item edit

Open Phabricator item to create Wikimedia for research list on lists.wikimedia.org, rather than https://groups.google.com/a/wikimedia.de/forum/#!forum/wd4r ? Or perhaps let’s stick with that for a little while?