Wikidata:Events/Data Reuse Days 2022/Outcomes/notes

On this page, we archive the notes from the Etherpad on a daily basis. Feel free to help fixing the format of the text, thanks!

Closing edit

🔗Slides and useful links:

🖊️ Notes:

(what where your highlights of the event?)

  • [nikki] scribe! 
  • [rodrigo] sciencestories!
  • [lucas] Wikxhibit 
  • [nikki] live improvements is one of the reasons i'd like more casual meetups
  • [tarfah] it was great to see how people outside Wikidata/Wikimedia are using Wikidata in such cool ways! 
  • [manuel] It was so great to see how Wikidata's data is currently used for good! So many cool ideas and projects! And it made me understand better what we can do to make improvements for reusers and editors alike. Thank you everyone for your work!

Pink Pony Session edit

✨Number of participants:  16 (17:03), 14 (18:09)

🖊️ Notes:

  • [Alan] Idea that came out of the civic tech panel: people of similar interest related to politics/social sciences would like to come together, maybe a stakeholder group or a meta-WikiProject?
  • [Jan Ainali] Improve the discoverability of WikiProjects
  • [Magnus Salgö] Where to share the notes of a meeting about [...]? -> archive the notes on a wikipage and share it around / sub-page of an existing WikiProject
  • [Nikki] earlier I was adding missing ietf language tag statements and was wishing I could put the iso language codes into sparql so I could compare them with wikidata's data... kinda like how mix-n-match lets people load a  set of data (which ISO code? - all of them!)
  • [Jan Ainali] I'd like to stop using Wikidata as a repository for "real-time" or even any time series data and instead use tabular data on Commons, only store the latest data
    • Connected to more support to tabular data, e.g. being able to query it in SPARQL
    • Why? because not all data can be stored in Items, some of them already become way too large
  • [Jim h] orcid expander tool. lots of "researchers" with bare number. with click through to orcid and papers written there is employer data. 
  • [Magnus Salgö] new backbone that scale better -> not unrealistic, the Search team is working on it
  • [Nikki] in telegram moebeus was wishing for a whole development cycle devoted to speed and reliability, making things load faster, etc 
  • [Rodrigo] New namespace for SPARQL queries, so examples can be reused
    • it would have titles in multiple languages so people could search for the queries and find them easily
    • why a new namespace? maybe this new namespace would store specific metadata about SPARQL queries. Just as lexemes and Wikidata items have statements, these SPARQL examples would also have statements. We would need to create properties for storing information about them. Thus, people would be able to find queries in their interest. For example, we could create the property "uses SPARQL endpoint" and thanks to this property, we could generate a list of SPARQL queries that "uses SPARQL endpoint" "OpenStreetMap" and "Wikidata".  Another property: "uses SPARQL keyword" so that people can filter queries by keywords, thus someone trying to learn how to use the "SUBSTR" keyword,  could find examples that use that keyword.
    • also it could have syntax highlighting 
    • https://www.wikidata.org/wiki/Template:Query_page might be useful in the meantime btw 
    • it would be nice if it were more integrated, like automatic links to the query service, instead of needing to wrap it in a template
    • we could also use badges (featured queries)
    • Rodrigo: I've shared more on this idea in https://meta.wikimedia.org/wiki/User:Rdrg109/0/3
  • [Jan Ainali] Not a pink pony, this is already an old horse, but I very much wish for [Epic] Wikidata watchlist improvements (client) https://phabricator.wikimedia.org/T90435
  • [Nikki] I think one of mahir's pink ponies would probably be https://phabricator.wikimedia.org/T199887 (client equivalent of haswbstatement) 
  • [nw520]  wish for a "Make adding references sexy" pony (MARS-pony). E.g. copy across items, auto-detect metadata like Citoid already does on Wikipedia, edit all references that share the same hash from the UI, detect 'stated in' from the URL, maybe even templates with predefined sets of properties depending on the type of reference and other QOL-improvements. 
    • Citoid has been blocked for a long time. In the meantime, there are gadgets that can make people's life easier, such as copy paste statements
    • [nikki] here was a user script which broke and I recently  forked it - https://www.wikidata.org/wiki/User:Nikki/CiteTool.js (it's the one that fills out title and date when you enter reference url)
    • There's already a gadget for copying references (MediaWiki:Gadget-DuplicateReferences.js) but it requires refreshes and doesn't work across different items.
  • [Jan Ainali]  I guess the pink elephant would be Wikidata Bridge 😃 
  • [Jan Ainali] Following nikki, mahir might also go for Add termbox language code mul https://phabricator.wikimedia.org/T285156
  • [Lydia] I'd love to see a small, engaging project that reuses Wikidata and gets good press and brings more people to discover Wikidata
  • [nw520] Another QOL-pony: When pasting a Wikidata-link in a property/value field it would be great if it'd no longer be necessary to have to wait for the suggestions list to load and to select the desired entity from the list. 
  • [Tarfah] External identifiers <3 Maybe we could query other databases at the same time as Wikidata, also using APIs other than SPARQL
    • federated queries! it already exists for a few CC0 databases in SPARQL
  • [Sannita] I'd be just excited to see LD used by Wiktionary 
    • It's in progress, 3 Wiktionaries in beta test at the moment
  • [TuukkaH] Will the LDF endpoint be enabled in Wikibase Cloud? 
  • [Andrew McAllister] Started looking into autosuggest for Scribe, help with Android would be appreciated since they don't have an Android device https://github.com/scribe-org/Scribe-Android
  • [Nikki] I want a way to query for strings by language, like finding monolingual text statements that are set to german (see also https://phabricator.wikimedia.org/T167361 )
  • [Lydia] Are there parts of the data you're working on, and wish that more people would reuse it to make something cool with it?
  • [TuukkaH] I wish there would be visualisations of the ontology for documentation purposes 
  • [Rodrigo] Another pink pony: A similar tool to https://prop-explorer.toolforge.org/ but for finding WikiProjects! 
    • [Nikki] finding which wikiprojects are even active would be good 
  • [nw520] Currently working on German courts and their hierarchy. Looks nice in a graph but much less automatable than I had hoped for since Germany kinda hates publishing machine readable data. That might be a pony too, governments finally embracing open data. 😉 
  • [Rodrigo] Outrageous unlikely: Support for all plots in https://plotly.com/python/ and https://d3js.org/ as visualizations in  WDQS 
    • [Rodrigo]: I didn't find the tickets
  • [Manuel Merz] better integration of edit groups
  • [Daniel Schwabe] Can we talk about including authorship info as data in WD (provenance)?
    • Often, when using WD, there are conflicting claims about certain facts. One important information necessary to implement any kind of trust policy is the authorship information of a given statement  (i.e., "Which user was the one that added this statement?")
    • This info is complementary to the value of any "references" links, which work as support for the veraciy of the claim, but are not verifiable in general.
    • Unfortunately, while available (e.g., extrated by https://www.wikidata.org/wiki/User:Ricordisamoa/WikidataTrust.js), the data itself is not part of WD.
  • [Léa] Question for you all: for future events with a similar format, are there any topics you'd like to focus on? (over the past year we had: lexicographical data, data quality and data reuse) 
    • [Nikki] more casual meetups for editing sessions, games and stuff! we never did identify any train stations 😛 
    • [Léa] If anyone is interested in running editing sessions, feel free to reach out to me, we can organize them together!
    • [Manuel] More events to create entity schemas for important classes! 😃 
  • Wikidata's 10th birthday is coming! https://www.wikidata.org/wiki/Wikidata:Tenth_Birthday
  • About food: menu challenge https://www.wikidata.org/wiki/Wikidata:Menu_Challenge
  • Politician challenge

Qloo / TasteDive + Wikidata – Virtuous Data Cycles edit

✨Number of participants:  12 (16:03) 13 (16:15) 🖊️ Notes:

  • started 10 years ago in NYC. Qloo offers privacy-first APIs that predicts global consumer preferences and catalogs hundreds of millions of cultural entities
  • end to end cultural intelligence, cultural vocabulary (structured intelligence spanning more than 575+ million entities)
  • Ontology: build a stuctured understanding of the DNA of every entity
  • Core API routes relies on the entity data. 
  • qloo is unique by going across domains and categories
  • search results uses Wikidata data, wikidata plays a large role in providing the content-base results
  • qloo has mapped anpnymised taste signals
  • virtuous cycle 
  • tastedive: consumer recommendation engine that is free to use, available for endusers globally. no ads on tastedive. traffic is due to queries. one of the core things here is that Wikidata helps validates the entities that exists in these categories. 
  • users can add new entities as well
  • work with streaming platforms with alot of identifiers. we ingest Wikdata's data, also the translation feature. alot of translations are powered by the 'aka'  (alias) field from wikidata. 
  • these has been a great value proposition from Wikidata 
  • companies want to give back, new entity generation and merging classification efforts, to Wikidata. e.g. fashion classification. wikidata is more about corporate info, but we help classify fashion as a design
  • consumer APIs have alot of divisions when creating redundant knowledge graphs
  • very eager to return the favour to Wikidata in whatever way we can. 
  • feel free to explore sandbox around our APIs. 

💬 Questions & answers:

  • What are the biggest challenges you encountered when retrieving and using Wikidata's data?
    • scarcity of properties [coverage], velocity of different data categories, managing different updates of different entities can be challenging
  • You mentioned giving back to the community: can you elaborate on what you'd like to do?
    • donate financially, have alot of classification efforts that we've invested in, generate new entities into our graphs that have been validated, high velocity domains, to help accelerate the moderation efforts of Wikdiata, 
  • Did you ever edit Wikidata, improve/add/fix data upstream, or did your clients do so?
    • yes, members of the team actively contribute to Wikidata and the ecosystem. 
  • Do you mainly use the external identifiers on Wikidata to establish a sameAs relation or also to retrieve data from the other linked databases?
    • the external identifiers are used in multiple ways. e.g. "music" coming from many different sources, hence de-duplication is an important effort. 
  • Do you have some weird/interesting/strange/funny/... examples of recommendations people can get from your system?
    • yup, some them are unpleasant, particularly when go across domain. 
    • Often high-brow content co-occurs with lighter fare (reality shows like "The Bachelor") we've done some work to model out the extent to which some of these edges may be occurring due to "ironic viewership" or other latent factors. 
    • Fun to try very attenuated associations such as Books <> Dining or Fashion <> Film


Wikidata for civic tech edit

✨Number of participants:  26 (18:08), 29 (18:33)

🔗Slides and useful links:

🖊️ Notes:

  • Comparative Legislators Database (CLD)
    • so far covers 10 countries, 45,540 contemporary and historical politicians
    • integrated with several existing datasets (from political science)
    • Analysis-ready data: rows= politicians, columns
    • accessed through R-package
    • relational database (a graphic with the tables the attributes of each table and the way the tables are  related was shown)
      • Tables: Wikipedia History, IDs, Offices, Core, Wikipedia Traffic, etc.
      • Some attributes of those tables:
        • Wikipedia History: pageid, revid, parentid, user, userid, time, size, comment
        • Core: country, pageid, wikidataid, wikititle, etc.
        • IDs: wikidataid, parlid, gndid, libcon, etc.
        • Professions: wikidataid, occupation_1, occupation_2, occupation_3, etc.
        • Political: pageid, session, party, constituency, session_start, session_end, etc.
        • (more tables and more attributes, for exact details see the slides)
    • Motivation
      • see article on Political Advertising on the Wikipedia Marketplace of Information 
      • survey from colleagues and fellow political scientists that this data is in high demand
      • overwhelming share of data collection is by own efforts of scholars, resonate with the data landscape of that time, offers a resource for scholars
      • data collection: from Wikipedia (entity identification)=> data collection=> cleaning, verification=> database integration
    • challenges
      • data quality: validate Wikidata data against existing (manual) datasets. Mismatch Finder seems promising to us
      • data availability: differential data availability between Wikipedia and Wikidata; over representation of Western Legislatures; contributing data in bulk difficult; recency bias and substantial missings on some data fields (more recent legislatures are better covered) 
    • Outlook CLD ver2 (summer 2022)
      • updating existing data, + 5 new legilative sessions
      • 5 new countries + european parliament
      • correction several errors
      • new website
        • granting data access in CSV and SQL formats
        • providing tutorials on contributing data to CLD and Wikidata
  • Govdirectory: the easist way to contact your government
    • Problem: who do you ask and where do you ask your questions?
    • Target group: the active citizen, 'early adaptor' for policy change, but not all people with that aspiration have the time to engage in traditional ways
      • The problem: "I want to engage, but cannot go to town hall meetings"
    • solution: govdirectory. A crowdsourced and fact checked directory of official governmental online accounts and services (built on wikidata and co-curated with the Wikimedia commonuity
    • start a conversation with gov
    • make your concerns public
    • Features:
      • basic information for context
      • online accounts and services
      • search by topic/ country
      • give feedback/ suggest improvements in Wikidata
      • community curated data in Wikidata
    • DEMO
      • https://govdirectory.org
      • List of countries: https://govdirectory.org . Countries included: Denmark, East Timor, Finland, Germany, Ghana, Greenland, Netherlands, New Zealand, Norway, Philippines, Russia, South Africa, Sweden, Ukraine and United Kingdom.
      • Page of Sweden: https://govdirectory.org/sweden
        • It lists agencies that exist in Sweden. 
        • It shows social media accounts for each agency
        • Filter by name
        • Once an entity is clicked, information is shown: parent organization, official website, phone, and platform accounts (twitter, facebook, github, youtube, linkedin, etc.). It also includes a button for the Wikidata item
      • Uses classification by UN
      • https://govdirectory.org/environmental-protection
  • OpenSanctions
    • sanctions that interests journalists, business community
    • interests in 'sanctions' has been attracting alot of attention in recent weeks
    • relate to Wikidata
      • feeds into wikidata and how we benefit Wikidata
        • partnership with other initiative, Peppercat, assemble data on parliaments in the world: every-politician-scrapers. Github organization: https://github.com/orgs/every-politician-scrapers
        • data on the structure of office holders
        • put a lot of these materials into Wikidata
        • sanction data is one part of puzzle. the other part is Politically Explosed Persons. access Peppercat World Leaders datasets, and import that into OpenSanctions, and feed into Wikidata. 
        • Output is a profile document. data from wikidata and other sources. providing an integrated profiles, including family members of these people. uses wikidata IDs as the authoritative IDs in Opensanction
        • alot of different projects that also uses Wikidata such as declarator.org. the information is pulled in from OpenSanctions and Wikidata. 
    • Russian Oligach Database
      • List of russian billionaires= oligach?
      • edit one of these profiles or their family members will most likely point you to them being sanctioned
    • Find more ways on feeding information on sanctions into Wikidata and OpenSanctions
  • OpenParliamentTV
    • Search engine for Parliamentary Speeches between different parliaments
    • Goal is to make parliamentary debates more transparent, accessible and understandable
    • Problem:
      • parliaments publish video recordings and sessions protocol/ proceedings but
        • in two completely separate systems
        • administered by two separate departments
        • published at two separate platforms
    • Consequences:
      • generally no connection between video recordings and spoken text
      • videos of the speech are not searchable (content wise)
    • Solution: sync them (videos and protocols) together
    • Result: interactive transcipts, sharable, link with different documents/ additional data
    • Reuse identifiers from wikidata to identify members of parliaments (QID on MP)
    • Features: Indexing the syncrhronised speeches, videos become searchable, search engine for parliamentary speeches
    • Long term goal:
      • make political debates accessible and link them between different parliaments e.g. state, federal, EU level
  • Party Facts
    • started in 2012, same as Wikidata
    • political parties data in many different datasources. 
    • Wikipedia and Wikidata a major source for curation
    • parties that won more than 5% in national election: major party criteria
    • linking 58 datasets, included 5,700 core parties, linked them to 40,000 parties in datasets parties (external)
    • Party sources
      • Wikipedia infox. limited only to relevant parties. not all in the world (deliberate)
      • Wikidata
      • only parties already in Party Facts
    • Application: use classifications from Wikidata and Wikipedia 
    • the Wikidata Promise
      • to provide a KG on all parties in the world, for the social sciences
      • opportunity to link social science observations
      • social scientists not there yet (SPARQL, triples, APIs...)
      • tools and examples that are more accessible needed (Python R, ...)
      • wikidata, concepts and technical things are hard to access for them. need more accessible tools and best cases and examples of how these can be done
  • Ideograph: explore ideologies of political parties with SPARQL, requests to Wikidata, using D3 and PixiJS
    • two types of nodes between ideology and political parties
    • force-directred graph, repulsion, gravity and attraction forces in graphs
    • groups parties which share ideologies, group ideologies shared by multiple parties, peripheral ideologies are more extreme or polarizing
    • URL: https://ourednik.info/ideograph/?poland,romania
    • filter by country and direct links
    • Observations
      • Euroscepticism seems to be the ideological heart of Europe
      • Environmentalism against fascism
    • P1142: "political ideology". some political ideologies are not political ideology in other texts
    • Framework: 
      • D3: force directed graph
      • PixiJS: 2D WebGL Renderer; much faster than SVG or canvas
    • Call for good practice
      • separate .rq files for the SPARQL queries, with dynamically replacable variables
      • better editabilty and readability by other users
    • Needs:
      • involvement on Github by other programmers for better GUI
      • More wikidata on the political parties and ideologies

💬 Questions & answers:

  • CLD (Comparative Legislators Database)
    • If I understood the graphic of the relational model correctly, the CLD database stores the revid of Wikipedia articles in the table "Wikipedia History". How is the information of each revision useful for CLD?
      • Sascha: We store the wikipedia traffic in those pages. Organizations have an IP range, when someone edits something in Wikipedia is linked to an IP address.
    • Modelling of members of parliaments is indeed quite messy in Wikidata. Since I'm currently working on members of Landtag and Bundestag: What information do you expect to have in Wikidata and do you already have ideas on how to model it (e.g. memberships in parliamentary groups – there are currently two ways to model this)?
      • Sascha: I don't really have an answer to this.
    • Do you see yourselves as political scientists first or techies first?
      • Sascha: We are political scientists, so I guess, political scientists. It depends on what you consider techie. The drive of this work is a substantive drive, its not a tech exercise. Most of times, the tech-part of the project is not fun, it is a service to the profession.
    • Are you already collaborating with "Member of Parliament Count"? (If not, there might be good synergies to be found.) https://mp-count.toolforge.org/
      • Sascha: I wasn't aware of this. I'll definitely check it out.
  • Govdirectory
    • How do you decide which countries are in the directory currently?
      • We add countries based on completeness of data. If we can query for a well defined set of agencies then we can add the country and add more agency types as they get "completed" on Wikidata. From experience, it's usually easy to start with ministries or first-level administrative unit agencies.
    • Would you like to enable people to edit directly Wikidata from your interface in the future? Do you already have an idea how to do it?
      • That is a very interesting thought, but for now we focus on A) directing people to Wikidata and B) improving tools that make the life easier for Wikidata editors (for example user scripts). One bonus of having people actually editing on Wikidata rather than somewhere else, is that it becomes very transparent on where they put data and on which terms.
    • UN classification is COFOG? Then you could probably link it up to government expenditure easily - show each body buy how much money they have :)
      • It is indeed COFOG (P9798), however expenditure is oftern reported per COFOG category.
    • For Germany there's also FragDenStaat (Q63413894, P6744 (FragDenStaat public body ID)), a portal for making Freedom of Information requests. Do you think that it might be worthwhile to collaborate or integrate their API too?
      • We link to FOI-platforms when an agency has P10214 (URL for freedom of information requests), example: https://www.govdirectory.org/sweden/Q508140/
      • One downside with FragDenStaat public body ID is that it is only one country. So far we only display properties that would be the same for all countries.
    • Not a question - check out: https://publicbodies.org/ 
      • Thanks, we do!
  • OpenSanctions
    • Besides curating information of politicians and people of interest in Wikidata, is there any other way people could make OpenSanctions more useful?
  • OpenParliamentTV
    • Are you looking to expand to other countries? 
    • Is text linking used in the subtitles for concept disambiguation? For example, if someone searches for a name that is used by two politicians, but only wants to get results for one politician.
    • Do you have any plans for expanding the usage of Wikidata in the project? 
      • > Yes. More data could be added that is based on wikidata.
    • Who writes the subtitles for the videos? Is it provided in the government website along with the video? 
      • > there is live subtitling provided by the german bundestag but mostly the text is based on the official protocols by the stenographic service of the bundestag
  • Party Facts
    • Do you know what are the main issues that social scientists encounter when trying to edit Wikidata? Any suggestions on what could be improved (e.g. interface, tools, data model...)
      • need tools to encourage social scientists more to improve the data on Wikidata
      • comment about getting election results into open data: that's a topic that Open Knowledge Foundation Germany is regularly working on  and some people from their volunteers are quite Wikidata-savvy :)
  • Ideograph
    • Are all the politial ideologies and political parties in the world represented in Ideograph?
      • I don't think so. Depends totally on Wikidata, hence insofar as it is known on Wikidata
    • How do you sort out the conflict that some political parties have broken off from another political party due to personality differences but still share the same ideologies in the same country? This is a common trend in countries with 'extremely vibrant democracies' where party hopping by politicians is not unusual.
      • manage to isolate political parties that are no longer valid, but the historical link to political parties will need to be inserted.
      • those relations between parties and ideologies need to have a time stamp. one can add qualifiers to timestamp. 
    • I am curious how the political parties in Hong Kong are aligned to in Ideograph, vis-a-vis their ideology as the ruling party is obviously heavily influenced by the Chinese Communist Party in China.    
      • suggest can take a look at how it looks like on Ideograph, and check to see if the data is up to date or not. 

Lydia's question about challenges:

  • keep track of changes that relate to a project
  • have labels in API responses


How inventaire.io reuses and extends Wikidata bibliographic data edit

✨Number of participants:  16 (15:03); 17 (15:30)

🔗Slides and useful links:

🖊️ Notes:

  • since 2014/15. keep inventory of books and share with your friends. 
  • allows different ways to organise the inventory
  • collaborative structure of the inventory, also personalisation features such as "shelf"
  • second pillar: wikidata-federated open bibliographic database, reuse and extended the database. all CCO
  • access data from API and SPARQL Query 
  • graph of entities used: inventaire.io entities map
  • building on top of wikidata provides access to alot of data and information
  • entities in Inventaire link to Wikidata entities. Federated properties and items, aggregated queries: SPARQL + local db
  • there is the option to move data into Wikidata from Inventaire
  • ad-hoc internal wikibase-like system
    • cons: a lot of duplication with MediaWiki/Wikibase
    • missing features: descriptions/ aliases/ qualifiers/ references/ ranks/ talk pages
    • pros: federated properties/ items; event hooks; update users examplars; pre edit constraints; properties per type; possibility to move entity from one instance to another; (future feature) federated layered sameAs entities
  • Caching strategy
    • local cache for items and query results (LevelDB)
    • 1 month expiration date, unless cache bust
  • types and constraints
    • our hacky (of typing entities) but still reliable and efficient typing system is still alive
  • search
    • indexing all relevant items from Wikidata dump per type
    • searching subjects (=any wikidata item); allows searching by type (via elastic search); search directly on wikidata
  • External ids
    • using national libraries data to create local entities with external ids
    • using those external identifiers to automerge and reconcile back to Wikidata
  • Challenges
    • diverging interests with the Wikidata community:
      • wikimedian project might have an incentive for maximalist/ aggregated entities vs narrow/ single concept entities
        • work and edition also e.g. anime and manga example in Wikidata
        • changes in project ontologies e.g. WikiProject_Books

💬 Questions & answers:

  • What challenges or issues did you encounter when building the Wikidata editing feature? Any tips you'd share with people who are also building tools to edit Wikidata from an external website?
    • challenge is to edit wikidata. there was no good library to edit wikidata from Node.js. I wrote one before called wikibase-edit on npm. 
      • Best solution: go through OAuth to edit Wikidata from people's accounts
    • Since our system doesn't integrate qualifiers and references, we got in trouble when the Books Wikiproject required to [...] and we didn't have a way to represent that. Now we have a function that makes a conversion from our representation to Wikidata representation. If we would build it today, we'd integrate qualifiers and references from the start.
  • How do you handle items which is a item of a work but is incorrectly linked with Wikisource editions? For example - https://www.wikidata.org/wiki/Q183157 ? Do you have mechanisms to fix these errors or FRBR data model errors in general on Wikidata ?
  • What is searching like? (for books, not individual values) Can users perform advanced searches by combining multiple properties/object values, excluding them, etc.? Is this done via SPARQL queries? (sophisticated searching is something I've found lacking in many similar apps) e.g. I want to find a good book to read!
    • still not implemented yet, but something we would like to do in the future.  - I'll keep an eye out. I think you could really own the space here :)
  • How do you handle manuscripts and periodical volumes, issues etc.?
    • at the moment we don't. we are definitely not there yet for manuscripts. we don't get that from users. 
    • the problem we have for mangas is that sometimes the chapters are gathered in different ways, depending on the local editors (we'd have to describe precisely what chapters each volume contains in each country, and we're not there yet, maybe it's not even worth it)

Wikidata – The secret sauce in many JSTOR Labs projects edit

✨Number of participants:  16 (13.04), 22 (13:21)

🔗Slides and useful links:

🖊️ Notes:

  • JSTOR labs is part of the JSTOR non-profit, providing scholarly resources to higher education institutions
  • Linked open data, Wikidata in particular, has been quite powerful and useful to JSTOR labs in a number of projects
  • JSTOR lab teams has existed for around 7-8 years, working on a number of prototype projects. 
  • Current focus is on two projects in particular: Access in Prison Initiative and ?
  • Interview archive
    • Makes use of Wikidata
    • Done in collaboration with the Kunhardt film foundation and HBO 
    • The focus is the documentary "King in the Wilderness", which uses input from US civil rights activitist -- but there was a lot of content that didn't make it into the film 
    • Project partners wanted to make connections between this background footage and create interesting ways to navigate through the film 
    • Using this tool, users can browse by interviewee e.g. John Lewis, or topic e.g Chicago, nonviolence, civil rights movement etc
    • Links to different topics on Wikidata items and Wikipedia, which pop up when the interviewees talk about the topics e.g. Montgomery Bus Boycott
    • Core idea: Wikidata is used to create these interconnections 
    • Ultimately, 40 hours of footage they were able to interconnect
    • Transcripts were provided to the project team, which were then manually tagged with Wikidata QIDs
    • Implemented content into self-hosted Wikibase instance, federated a lot queries from Wikibase instance with Wikidata
  • Understanding series
    • Project about improving discoverability and connectability of content
    • You can scroll through the primary text, with JSTOR articles linked to specific text passages that are referenced in other texts. The primary text acts as a bit of a hub, with connections to other JSTOR resources on related topics. 
    • Pulled alias' and metadata from wikidata, and then look for connections with entities in Wikidata 
  • Plant humanities lab
    • Project performed in collab with Dumbarton Oaks research library
    • A series of visual essays on different plants in the database
    • Includes a semnatic search tool that pulls data from Wikidata and other sites
    • Wikidata is also used for automatically obtaining location coordinates
    • Generating infobox type popups for mentioned entities
  • Juncture
    • Juncture is a generalized tool that anyone can use to create visual essays, like with the plant humanities lab
    • Kent maps: using Juncture with content on the area of Kent in the UK 
    • Allows user to create text narratvies augmented with maps and images 
  • DEMO TIME 

💬 Questions & answers:

  • reference to Interview Archive: King in Wilderness project, do you use other sources of databases besides Wikidata? 
    • the videos are hosted on Youtube, and also JSTOR repositiory 
  • Any plans to add further to the Wikibase instance for this, or other projects?
    • Used it for a number of projects, also use it for plant humanities. Entities in local wikibase instance used for federated search, use it quite a bit. It provides a lot of capabilities for connecting content. 
  • can you talk a bit more about the good and not so good experienes you had relying on data from wikidata for your projects?
    • Mostly positive. There is an initial learning curve with sparql, graph is so large that if youre not careful about your queries, you run into timeout issues. 
  • what is the url of the website you showed before juncture with the sunflowers?

KGTK: A toolkit for reusing Wikidata edit

✨Number of participants: 17 (19:30) 21 (19:45)

🔗Slides and useful links:

🖊️ Notes:

  • Reuse: a function that takes Wikidata as input and files of different formats, undergoes transformation into a dataset and use in an application 
  • Common use-case: just reuse Wikidata, without combining other datasets
  • 2 use cases:
    • I want to create a "coauthor subgraph". the new subgraph uses a coauthor property that is not present in Wikidata. and the interesting analytics on co-authors. see co-actor graph. a reuse of data to analyse actors existing in Wikidata
    • I want to create a graph about movies and books that exist in Wikidata. I want info about the people, organisations and places involved, include "subclass of" edges to root. I don't want the graph to show chemical compounds and rivers.
    • experimentation related to this in https://github.com/usc-isi-i2/kgtk/blob/master/tutorial/build-kg/build-tutorial-graph.ipynb
    • example: Arnold Schwarzenegger in KGTK: https://kgtk.isi.edu/iswc/browser/Q2685
  • KGTK for this kinds of use cases. Creating subgraphs of Wikidata is hard (using SPARQL endpoint)
    • hard to specify what I want in the subgraph
    • I want to get a coherent subgraph 
      • classes and super classes
      • property definitions
      • qualifiers and referenes 
    • computionally expensive
  • KGTK toolkit to manipulate data from Wikidata: import (from Wikidata)=> KGTK pipeline=> export (same format as miniwikidata)
  • Wikidata reuse with KGTK
  • Why use KGTK?
    • (A) KGTK on a laptop (B) SPARQL 256GB local server (C) SPARQL public
    • First names : (A) 8.28 minutes, (B) 31.05 minutes, (C) time out
    • Class instances: (A) 88.97 minutes, (B) >24 hours, (C) time out
    • Film instances: (A) 0.04 minutes, (B) 1.91 minutes, (C) time out
    • author network: (A) 66.39 minutes, (B) >24 hours, (C) time out
    • Cancer network:  (A) 2.62 minutes, (B) 40.19 minutes, (C) time out
    • ULAN identifiers: (A) 0.20 minutes, (B) 1.08 minutes, (C) error, query too large
    • DBpedia spouses: (A) 3.43 minutes, (B) n/a, (C) n/a
  • KGTK on laptop can run these queries much faster than directly via SPARQL. That's why we build KGTK Expensive use cases to run to create analytics and reuse the data
  • KGTK data model: essentially Wikidata data
  • represent <edge-id, subject, predicate, object> in TSV
  • all KGTK commands take TSV as input and then output
  • Based on MillenniumDB (A Persistent, Open-Source, Graph Database) internally.
  • Query language using 'Cypher' cos people like Cypher more than SPARQL
  • KGTK pipelines: little red symbols is how you can change the graphs
  • I can create coauthor subgraph in 1 hr on my laptop and create a pretty huge file. some authors have 396 paper! 
  • Summary
    • create sophiscated subgraph using KGTK, much faster than SPARQL
  • Reuse Challenges
    • is the data correct?- almost always
    • is the data up to date? - not always. see Heineken example, Wikidata is not up to date vis-a-vis Wikipedia
    • is the data complete?- sometimes.  some classes are more complete than others. IMDb has 8.7m titles and 11.4m person records, Wikidata 316,000 of film, 9m people total. Wikidata data is not complete.
    • is the data skewed in some way?- sometimes. depending on which communities are more dedicated to put data on Wikidata. Film data looks to be skewed. Germany seems to be over-represented in the film data in Wikidata, as compared to e.g. India.
  • We need to develop methods to characterize correctness, freshness and completeness of Wikdiata. so reusers will know what they are getting into, whether WD is appropriate for their use case. 
  • What I woud like
    • class profile: completeness report for every class; data skew report; data freshness report
    • property profiles (for statements, esp when something has multiple values): completeness report; data freshess report

💬 Questions & answers:

  • How big is the file for the analytics compared to the Graphs in Sparql. Does KGTK need much more space for the same analytics or is it comparable to SPARQL.
    • the presentation of KGTK is much more compact than RDF. Data is subsetted rightaway. the ratio of KGTK statements vis-a-vis triples statements is 10:1
  • In the table that compared the execution times of KGTK and SPARQL, it is shown that KGTK on a laptop runs faster than a SPARQL query on a local server. What makes KGTK run faster than SPARQL?
    • it is much more compact. no multilingual data, partition about 30 different files and selectively load some of them onto the database and can take advantage of the quality
    • pre-compute super class/ subclass closures. this makes lot of difference. pre-compute some intermediate products and build indices on the qualifiers.  
  • What would be the workflow to extract, merge with other datasets, then export back to Wikidata?
    • we do this all the time actuallly
    • Add new properties as needed
    • generate TSV files > JSON > SPARQL 
    • typical workflow is like a jupiter notebook
  • what is the updating cycle of the kgtk data? is there a direct sync?
    • no, no direct sync. we do only sometimes, depending on customers. 24hrs to import WD files into KGTK format. another 2 days is required for processing, embeddings take 2 weeks. goal is to do this every month, but not yet done. 
  • Is the Cypher implementation identical to that of Neo4j or something custom for MilleniumDB?
    • we don't support the full syntax of Cypher. 
    • MilleniumDB is a different DBMS, not used in KGTK! It uses a very similar Graph model as KGTK.

Entity Alignment Checking Using the Wolfram Language edit

✨Number of participants:  15 (17:04)

🖊️ Notes:

  • Wolfram has a curation team that improves their knowledge base
  • Fixing the referent
    • External identifiers
    • Suitability
  • Wolfram in your browser
    • capable of natural language understanding (NLU) and entity disambiguation
  • Architecture
    • Alpha server powers Wolfram Alpha and Wolfram Language. Wolfram Alpha then powers Website and App. Wolfram Language then powers Mathematica, Cloud and Wolfram Engine
    • Note by the speaker: Presentation is done in a Wolfram Notebook which allows him to do computations in Wolfram within the slides
  • Features of Wolfram related to Wikidata
    • Entity Discovery: Search by label
  • Alignment from Wolfram entity to Wikidata
  • Use of Type Alignment
  • Wolfram Alpha Curation Process
    • Wolfram Language is used to create curation notebooks and then used by the curators
  • Items in Wolfram store the Wikidata ID with the name "Wikidata ID". 

💬 Questions & answers:

  • Is Wikidata used to curate the knowledge in Wolfram? If so, what are some issues that you have found when reusing Wikidata's data?
    • yes we use Wikidata for various domains. One of the issues is the semantic drift of Wikidata identifiers (e.g. was not very well specified. fixed the meaning, someone fixed a movie to a book, but when we looked, we saw a movie). that is the issue. That's why the focus initally is fixing the meaning. 
  • In slide 37, was the value for the statement "described by source" retrieved from the WIkidata API or does the Wolfram Server have a clone of Wikidata?
    • directly happening on Wikidata. no caching, immediately see the updates made. 
  • Is there documentation on how WolframAlpha knowledge base internally works? I'm interested in knowing how Wolfram make relationships between entities
    • no documentation. implementation details should change and evolve over time, without the user having to know. 
  • A question out of curiosity: Can SPARQL be used in some way to interact with the knowledge in WolframAlpha?
    • yes, partially some very experimental features to query entities using SPARQL. not sure if should continue that route. 

Mismatch Finder: empowering data re-users to give back edit

✨Number of participants:  22 (16:06), 24 (16:19)

🔗Slides and useful links:

🖊️ Notes:

  • its a cruel world out there... vandals are screwing up our data in subtle but evil ways.. makes us sad
  • also, the world is changing around us... 
  • reusers really want to help but don't really know how
  • Mismatch Finder can hopefull help us
    • someone has a way to automatically and at scale compare Wikidata's data against another database/website
    • prepare a CSV file with these mismatches
    • upload to Mismatch Finder
    • others can review these mismatches and figure out whether the issue is in WD or other data source and make edits accordingly
  • Conventional
    • DOB statements for German authors between WD and German National Library
    • Band member names between WD and MusicBrainz
  • Unconventional 
    • local infobox data from ENWiki and Wikidata
    • user report errors
  • How can you help?
    • enable the user script
    • use the Mismatch Finder website
    • work with us to get them into the Mismatch Finder

💬 Questions & answers:

  • Is there a way that the types of mismatches found/feedback on mismatches can be analyzed to improve data sources?
    • see  Mismatch Store, import status and download statistics. more abilities to understand the feedback is useful
  • How could the tool work with flagged errors (for example, a library catalog that displays knowledge cards with Wikidata, and a library user notices an error in the data and submits a feedback form to the library)?
    • a value on external source e.g. value from another library, the info will be accumulated 
  • Will "wrong data on Wikidata" automatically correct Wikidata, or do I have to do it manually?
    • no, Mismatch Finder right now does not edit Wikidata. human is still needed to make a change, if the problem is with Wikidata, and set the status to review. 
  • Wouldn't it reduce ambiguity and generally be easier to say "right data on X"? 
    • why not
  • Another level of reason why its wrong like we have for reason for lower rank Property:P2241 that is an instance of Q27949697
  • Is there any method you recommend using for finding mismatched data? Suppose I'm from Colombia and I want to improve data related to Colombia in Wikidata, how can I find those mismatches?
    • look at a few items related to Columbia e.g politicians and what kind of external IDs or databases. See if they have meaningful datadump/ SPARQL endpoint, prepare items with those IDs and compare that data and find the cases that do not match. Then prepare the file is ready to be uploaded into Mismatch Finder and upload there. Having those IDs would be extremely useful. 
  • How much can we use this tool? External sources do not have "talk page" like Wikipedia. 
    • match everything they have via Mixnmatch and then query everything
  • Related to external sources: how do you see Mismatch Finder improving data in external sources? Primarily through them downloading their stats and making changes themselves, or individual editors making changes, both?
    • depends on the external source. ENWIKI can be edited directly. if its with German National Library, cannot directly edit. Need to find ways to collect these mismatches and make it accessible for these people. the community can also find the issues and alert the database owners/administrators to make the edits.  
  • How is Mismatch Finder able to retrieve information from external sources? Does the tool have a web scrapper for each of those sites?
    • it does not. doesn't know about the other databases. need to upload csv file that contains the mismatches. Mismatch Finder does not do scrapping. some databases are structured, and many are unstructured. hence a general solution is really really hard to implement here. 
  • Question from Lydia: What data sources would our audience like us to check against? 
    • Library of Congress +1
    • OSM +1
    • library authority files
    • ORCID
    • National Library of Medicine
    • Getty vocabs (ULAN, tgn) +2
    • Geographic Names Information System
  • add identifiers (QIDs) from Wikidata into the box ("which items should be checked" part) 

Lightweight entity linking with OpenTapioca edit

✨Number of participants:  17 (15:15)

🔗Slides and useful links:

🖊️ Notes:

  • Entity linking: Annotate text with entities from a knowledge base
    • The same name can refer to multiple concepts, but in a paragraph, it refers to an specific concept. For example: "[ Associated Press ] writer [ Julie Pace ] contributed from [ Washington ]". In this sentence, the name Washington refers to multiple things: capital city, state, 1st president of the USA, etc.
  • Why build a new entity system linking? (1) Lack of systems specifically built for Wikidata (2) Wish for lightweight pipeline that can be trained easily  (3) Support for the NIF format for training and evaluation
  • Demo: Spotlight by DBPedia (tool for doing entity linking)
  • Properties of OpenTapioca (a lightweight and easily configurable system)

💬 Questions & answers:

  • What inspired you to name OpenTapioca, "OpenTapioca"? 
    • quite like tapioca, and a googleable name. there is another tool also named OpenTapioca. see also OpenManioc, TopiOCQA. 
  • Have you tested Open Tapioca with biomedical entities? E. g. proteins, cells and diseases
    • curious to see if this could work. 
  • When I want to use OpenTapioca in a specific domain. What considerations should I make? E.g. concerning the configuration or other aspects?
    • narrow domains are great in general. it is easy to think of the types of entities you think about and then complement with properties an item should have. 
  • What do you consider the hardest issue that you found during the development process?
    • something that I didn't mention is that you can, in principle, in sync to Wikidata. To get that done, I needed to ...
    • Second, ...
  • Are there any issues you are currently trying to solve in OpenTapioca?
    • not a project that has been actively working on the past few years, but will be happy to onboard anyone who is interested
    • willl look into update process and make it more resilient 
  • Do you have experience in fine-tuning Open Tapioca? Like to confirm or reject 
  • What sort of uses can you think of. 


QAnswer: query Wikidata in natural language edit

✨Number of participants:  15 (as of 16:10)

🔗Slides and useful links:

🖊️ Notes:

  • QAnswer: a tool to query KG in natural language
  • the goal is to access in natural language Wikidata's data
  • digital assistants
    • also use Wikidata, especially when using KG
  • one of the main consumers of Wikidata data is that smart systems are heavily reliant on wikidata data to answer questions
  • given a natural language question, it is translated into SPARQL query language question
  • images can also carry geo-coordinates  
  • Many queries even Google cannot answer e.g. give me all politicians borned in Berlin
  • even translations are possible e.g. what is dog in German/ Italian etc
  • Hints to be useful to community
    • Help generate SPARQL
    • find missing contextual information e.g. cast of LOTR, external links are rendered in the search result
    • fixing schema issues. differences in data modelling has consequences 
    • have fun
  • Where we like to go
    • Query structured and unstructure data
    • we want to follow Wikidata live! what we do now is to take Wikidata's data and index locally so this process is always behind Wikidata
    • QAnswer for Wikibase. Why not querying another Wikibase, and not only Wikidata? e.g. EU knowledge graph WB instance

💬 Questions & answers:

  • Can QAnswer answer yes/no questions? For example, Is Michael Jackson a singer? Is Joe Biden the president of USA?
    • we are not very good at it. generating "ask" queries. in principle is possible.
  • I did a quick search for "parks new york city" and didn't get any results ("Sorry, QAnswer is not able to find a good answer for this question"). Any idea as of why? Querying WD I get results.
    • are there entities for parks to new york city? There could be no such answer in Wikidata, not sure of the answer [there are results in WD-included link below]
  • Do I need to be familiar with WD's ontology to be able to ask questions (e.g. use especific name of properties, etc.)? 
    • we always try to match the labels of entities or properties as main anchor points to answer the questions. 
  • Was the question "what is dog in german" answered using Wikidata lexicographical data?
    • Yes, we try once to index, but not sure currently if we are doing this. We didn't explore too  much lexicographical data. Also data from Wikimedia Commons. 
  • Does QAnswer uses Wikidata lexicographical data in some way?
    • ^ Already answered in the previous question.
  • I don't see an option for changing the language in the interface. I also don't see the buttons for "SPARQL List", "Did You Mean" or "Direct Answer"
    • when there is not answer, have to click on the button "Direct
  • Question related to the book example shared: could the query be "publications by [name]" to include all works? Example: there are 85 works by Finn Årup Nielsen in WD, but only 10 works to which he is associated with comes up.
    • Dennis: check ...
  • ** a search by "articles by Finn Årup Nielsen" gives the same 10 results in QAnswer
    • the problem is "what are publications". We try to stick tightly on what is modelled in Wikidata. this could be a problem of knowledge modelling
  • Comment: It would be impressing if QAnswer could reach a level where long and granular questions can be answered. For example,
    • Question example: In which universities have studied people that have been awarded a Nobel Prize?
    • Question example: What are the medical condition of characters that exist in animated series? This SPARQL query can answer that question: https://w.wiki/4yRJ
    • Dennis: I agree, but It is not easy to translate a question in natural language to a SPARQL query.
  • These are the results for the query about parks in New York City: [1]

Histropedia: Triumphs and challenges of reusing Wikidata edit

✨Number of participants:  20 (as of 16:01) 25 (as of 16:05) 27 (as of 16:18)

🔗Slides and useful links:

🖊️ Notes:

  • Background
    • Originally based on Wikipedia categories, originally thought they were clean and useful. Thought it was a like a tree, but later realised it was more of a mesh. 
    • Realised quickly that we had to change our expectations of Wikipedia categories, then Wikidata came onto the scene. Lightbulb moment where they realised that Wikidata would be super useful for the histropedia project. 
    • Wikidata changes things for a number of reasons..
      • It contained real structured data
      • Date precision on Wikidata. Before Wikidata, Histropedia only had year-only dates
      • Unlimited potential for finding, organising and filtering content: no longer limited by Wikipedia categories and infobox scraping 
      • Phenomenal list of data reuse options - all under a CC0 licence! 
      • The data was Multilingual which means future versions can be made available in other languages
    • Overall, Wikidata opened a world of opportunities for the histropedia project
  • Major challenges
    • Ever changing subclass tree
      • Histropedia makes use of subclass tree, which makes it sensitive to chanes in the subclasses when importing from Wikidata 
      • One bad edit creating one subclass which leads off into an entire branch e.g. director example [insert query link]
      • We can counter this by keeping an eye out for extremely large changes when doing our import
    • Varying specificity of statements
      • Some statements can be extremely specific, whereas other remain quite broad (seen a lot in instance of statement). Super broad cateogries such as "building" versus extremely specific cateogries such as "jesuit church" or "destroyed church". These varying levels of precision in the categories can make things difficult and inconsistent
    • Multiple values with no preferred rank
      • It is very difficult to group content when there is no preferred rank
      • Example: Q181678 (dolph lundgren), many occupations listed, but no clear preferred occupation that allows the individual to be slotted into a category 
    • Determining living people / ongoing events
      • The absence of a death date does not mean that person is alive, however, there are some people where the death date is not known 
      • Recording unknown value for the death date or end date for people and events would help a lot
    • First event of its kind
      • There is no agreed upon way to specify that something is the first of its kind 
      • For example, no clear way to communicate that Neil Armstrong was the first person on the moon, or Elisabeth Garret Anderson was the first woman in Britain to qualify as a surgeon and physician 
      • Relying on the date doesn't really work, it would be better to have a property that communicates "firstness"
    • Confidence in references
      • Not currently a big problem for histropedia, but definitely come up in conversation with other wikidata reusers
      • If users edit a statement that already has a reference, then how do we make sure we don't end up using the wrong references for certain statements 
      • Signed statements would be an exciting development here
    • Inconsistent data
      • No schema built into Wikidata, community has been free to define it
      • The flipside of this freedom is having content added in different ways by different editors, resulting in inconsistent data
      • Wikidata schemas could help mitigate this issue, but tooling doesnt yet exist to help everyday users 
      • "Propose a schema" should make it easier for editors
  • Future developments for Histropedia 2.0 
    • Histropedia sheets tool https://js.histropedia.com/sheets/
      • A way of prototyping the kinds of content people would like to see in the directory of timelines 
    • Editing Wikidata from histropedia
      • We would love to be a window into potential data issues on Wikidata 
      • Future versions will allow user to log into Wikimedia account and edit data in Wikidata via histropedia, or simply flag a value for investigation by the Wiki community 
    • Multi-layered timelines
      • Visualise and compare any Histropedia timelines, as well as your own personal content such as family history or genealogy records
    • Charts as timelines layers or background 
      • Would love more statistical data, for example covid death rates, populations, GDPs etc to find correlations with world events 
    • Maps, connection graphs & other viewing options 
      • Switching between timelines, map, grouped list, graph and other supported viewing methods 

💬 Questions & answers:

  • Question: you've given lots of major challenges, are there a couple of past challenges that you could highlight where you think things are now solved for Histropedia? (--Mike Peel)
    • Wikidata dealt with our five major challenges straight away. Data consistency and quality has really improved on the years, active community members who are managing improvements in certain areas of the graph. Wikiprojects have been great in this regard and super helpful to histropedia. THe SPARQL query service eliminated so many problems for histropedia as well, as it enabled us to ask really specific questions. In all areas, you can see improvements, and everything is going in the right direction e.g. data quality
    • Related question: any advice you'd give to people/organizations that are just starting a project involving reusing Wikidata's data?
      • Explore as much as you can using the query service, really delve into the data you are reusing. Seek out members of the community editing the content. If you have a particular focus, then try to deeply understand the variation in how that part of the graph is structured. 
      • If considering a project reusing Wikidatas data, then feel free to reach out to Nav as hes happy to bounce ideas around and help others make use of Wikidata (contact details in the presentation slides)
      • use query service well
      • grabing a data dump and working locally
  • Question: Shouldn't the provenance info, especially the author, be in Wikidata itself, as opposed to being an external reference? This would enable formulating "trust policies" wrt to "truth" of claims made in statements
    • Depends on the reference as they vary somewhat. For histropedia, when it comes to references, any improvements and further information would always be better. Signed statements would be helpful here. 
    • It is indeed helpful when the reference points to a WD page
  • With schemas, could you import de-facto schemas from, e.g., enwiki infoboxes that link against Wikidata? (--Mike Peel)
  • Is all the data shown in the timeline of Museo Del Prado retrieved from Wikidata?
    • Quite a lot of it is. Some Wikipedia involved. 


Databox: a simple Wikidata-powered infobox for Wikipedia edit

✨Number of participants:  19 (14:00)

🔗Slides and useful links:


🖊️ Notes:

  • No/ insufficient Lua coder in community, hence Databox was created
    • [Comment- Andy Mabbett] This is the third session in this series when I've heard of lack of a Lua coder being an issue.
  • No local configuration, single Lua module, > 30 Wikipedias,
  • Features
    • displays most of WD properties
      • regular string properties
      • times
      • Properties connecting to other items
    • But not:
      • external identifiers, quantities and medias
      • A list of properties to ignore
    • main image at the top
    • type in a banner
    • map if geographical coordinates
  • How to use it?
    • start translating properties in your language. Should be a community effort to bring new content in local languages. 
    • copy the WD module and template code
    • create usages rules and experiment with it
    • customize it (Swedish community did that)
  • Crucial: having properties translated in the language you are using
  • Experiments
    • adding to the wikitext and previewing is a great way to see what's missing on your wikipedia/ Wikidata
    • even if you end up using another infobox template
  • Future of Databox
    • Merge with Template:Universal Infocard
    • Global template: https://w.wiki/Aho
      • [Comment - Andy Mabbett] Eventualy this should make "copy the WD module and template code" unnecessary, as code will be transcludable, just like images
    • Configure with ShEx?
  • Demo

💬 Questions & answers:

  • [Andy Mabbett] This is very similar to Commons' Wikidata Infobox - is there any shared code, or other collaboration?
    • No, it's two different projects, no shared code as far as we know
  • Can this be used on incubator projects? (Merci!)
    • Please, please, please do this if it doesn't work! It would be so inspirational for the new language communities!
    • Is it possible to get the same query (https://w.wiki/4xPh) for incubator languages, please?
  • Do you have an overview of how many Wikipedias use Databox? Did you get feedback from some communities about the use of the template, the challenges they encountered?
    • more than 30 Wikipedias using it!
    • also projects using it as a code basis to create their own templates
    • some WPs use it only in draft/talk pages, as some content would only be displayed in English (included new content that would suddenly appear in the infobox)
    • also used as a redacting tool (write content based on the data)
  • Any language fallback mechanism ? For example for properties newly created
    • General Mediawiki one (list of fallback languages for each language, eventually leading to English)
  • Does it use the ranks ? Ignoring deprecated statements. Ignoring normal statements if prefered statements exist ?
    • yes, they use the best statement
  • I have zero knowledge on importing modules in MediaWiki and I want this template to exist in a Wikipedia language edition. Where can I ask for help? Are there any tutorials about creating and importing modules?
  • VIGNERON: how/where to translate "unknow value/valeur inconnue"? For instance on https://br.wikipedia.org/wiki/Gallienus
    • Lucas is looking where the translation is store ("probably in Wikibase.git, let me see").
  • It will be appreciated if we can have a video tutorial on data box and how it can be used in wikipedia. The demo was little speedy and couldnt catch up.
  • [Andy Mabbett] I see that the template is also available on two Wikisource projects - are they using it (much)?
  • The dates are not shown in other scripts. Is there any way to do that. 
  • Can it display qualifiers? and references?
    • Not by default, could be implemented as customization, based on the needs and habits of the project
  • Is it possible to get the same query (https://w.wiki/4xPh) for incubator languages, please?

Workshop part:

Wikidata and AI for Social Good edit

✨Number of participants:  9 (11:00)

🔗Slides and useful links:

🖊️ Notes:


  • Started in 2021, Data Engineering and Semantics Research Unit is the first research unit in the country that specialises in Wikimedia projects, collaborates with Wikimedia Tunisia
  • Research: resuing Wikimedia Projects (Wikipedia, Wikidata, Wiktionary, Commons)
  • Reusing Wikimedia Projects: Why
    • Findable: can be found online for free
    • Accessible: using variety of tools (UIs, APIs, SPARQL) 
    • Interoperable : aligned to many other resources
    • Reusable: freely licensed, no legal concerns
    • In Africa, 
      • Limited Resources: low funding, small infrastructure, lack of human capacities
      • Need for Digitization
      • Data Scarcity
  • Aim to create apps and tools for general applications to meet UN Sustainable Development Goals (see slide 10)
  • Tools: 
    • User Interface
    • Wikidata Query Services
    • MediaWiki API
    • REST API
    • Wikibase Integrator
    • Javascript modules
    • Lua Modules (to create infoboxes powered by Wikidata)
  • Applications
    • Named Entity Recognition and Topic Modelling
    • Topic Modelling-Based Recommender System e.g Covid-19
    • Dashboards for Covid-19 Knowledge (consist of other information about Covid-19 to Wikidata, provides people with up-to-date information about Covid-19)
    • News Tracking
    • Structured Description of Wikipedia Categories 
    • Text-to-speech system for Wikipedia (http://sawtpedia.wiki)
    • Digitizing Culture Heritage (http://makumbusho.wiki)
    • Measuring Semantic Similarity
    • Living Scientometric Study (see Scholia)
    • Living Systematic Review (how Wikidata is reused in distributed knowledge resources)
    • Biomedical Relation Extraction and Classification (machine learning driven biomedical relation classification based on the MeSH keywords of PunMed scholarly publications) see slide 31. 


💬 Questions & answers:

  • What are some of the problems/ challenges in reusing Wikidata's data in your projects?
    • Main challenge: quality assurance. There are many solutions that have been done to solve this problem.
  • How can Wikidata be improved to help you achieve your goals to bridge the gap between Global South and North?
    • decolonize structured data! (cf projects done with Whose Knowledge, https://www.wikidata.org/wiki/Wikidata:Reimagining_Wikidata_from_the_margins )
    • to adapt the data model to support more information in the Global South. e.g. in job positions, how people evolved and designation titles are different from North and South
    • need more contributors from the South, to add more inputs about their countries. still more areas of interests are not covered by scholarly publications e.g social and artistic fields. 
  • What other sources of data (besides Wikidata) do you use in your projects?
    • using sources from all wikimedia projects to drive our applications
    • every Wiki project has their different use for different applications and projects
    • Public information such as MeSH keywords (in medical field) are used to complement WIkidata's data in projects. 
  • What is the best toolset for NER?
    • Maybe idea featured in Wiki workshop is better (e.g., Relation-Based Embeddings, Relation-Based Pre-trained Models, and so on). Relationship between items in the same statement (e.g., COVID-19 is currently a possible cause of pneumonia) to see if it is actually the item A (e.g., cause: cause (Q2574811)) or or another item with the same label or alias (e.g., cause: etiology (Q5850078)). Can be useful as toolset for NER. 
  • Can you explain, how do you use Wikidata in Wikipedia ?
    • many ways of doing that... demo (querying, Wikidata QS, Lua)
    • Modèle:Catégories structurées on French Wikipedia

Wikidata & OpenStreetMap editing session edit

✨Number of participants:  22 (at 17:10 UTC)

🔗Slides and useful links:

🖊️ Notes:

💬 Questions & answers:

  • Comment from Andy Mabbett: Some interesting dicussion, ongoing, on the OSM-GB mailing list, about 1:1 relationship between QIDs and OSM objects, and the way some [UK] place items on Wikidata represent multiple concepts: https://lists.openstreetmap.org/pipermail/talk-gb/2022-March/028757.html 
    • Also happens with e.g. brands vs. companies; organisations (schools, msueums, hospitals, etc.) and their buildings/campuses
    • Comment in chat: Jan Ainali says: Let's aim for long-term. Those items need to be cleaned up, no need to do workarounds. That will only deincentivise fixing the data at all. We should have a Template:Conflated concept template to put on the talk page for when you find something that is wrong but you can't figure out exactly how it should be. 
  • What to do when there are multiple OSM features but only one Wikidata item? https://osm.wikidata.link/Q110597432
    • Sometimes you use a tag other than wikidata= (so you can’t use the OSM–Wikidata Matcher). For example, OSM maps each individual McDonald’s location, while Wikidata has a single item for the fast food restaurant chain. Use prefixed tags like brand:wikidata= for these cases: https://wiki.openstreetmap.org/wiki/Key:brand:wikidata . The name suggestion index (https://nsi.guide/ ) is based on this idea.
    • Sometimes you tag multiple OSM features with the same Wikidata QID. For example, OSM has micromapped each individual platform of a tram stop, but Wikidata has a single item for the stop.
  • How to help in Ukraine?

Wikidata & games edit

✨Number of participants:  21 (16:15)

🔗Slides and useful links:.

🖊️ Notes:

  • Wikitrivia = great PR
    • Wikitrivia is fun, simple, educational and engaging
    • it gives back and makes Wikidata better
    • wikipedia, in addition to wikidata, is used to
      • determined the popularity of events (page views of the ENWiki article)
      • get images and names
  • Egunean Behin (iOS and Android): 80k different questions, some of them coming from Wikidata/Wikipedia. 60.000 players every day.
  • Wordle/ wordlegame.org
  • WorldLeh
  • guessr: Uses WD and pictures from Common: 
  • Wikidata Card Game generator
  • Gene of the Day (wordle-based game): quite nerdy for gene-nerds.. :) 
    • query for gene_symbol (from Wikidata)
    • you can tailor it differently if you are nerd for different topics.. :) 
    • there are 5 people who are actually playing this game
  • DerDieDas: Practice your German articles with Wikidata; can also modified for other languages and test prepositions instead of articles
  • Scribe: a keyboard that helps people with languages (see Scribe session); the goal for Duolingo is to translate Wikipedia.. hmm.. 
    • That was something that the founder mentioned as Duolingo is actually commercially translating things (it's how it makes money).
    • General idea that I threw out is that we could potentially use the information that people use in potential Scribe games to update things.
  • vglist.co: website to import information from Wikidata about video games. games have steamID too. 40,000 different games. Doesn't really sync back to Wikidata. Do contribute back alot to Wikidata; 41470 with Wikidata IDs (hence imported from Wikidata). blocks Pokemon red/ blue cos technically not a video game. 
  • Jean-Fred: Wikiproject video games
    • Data Reuse: wikipedia
    • enwiki: articles about video games, series, reception (generated by Wikidata) 
    • outside of wiki world: video games databases can be fan-based (example of one that uses Wikidata to generate crosslinks), commercial or institutional (crosslinks with wikidata, import alternate titles, title variants from Wikidata in different languages), mediaarts database (data model is lower than what we do, and also take data from Wikidata), academic circiles (Using Wikidata as Authority for Video Games)    
  •  

Commons File Infoboxes based on Wikidata and SDC edit

✨Number of participants:  22 (19:05), 15 (20:03)

🖊️ Notes:

  •  Template:Information template used on 70M pages (85% of files in Commons)
  •  Slide: History of changes
    • Initially, all templates were written in wikitext
    • 2013: Lua was introduced
    • 2016: Usage of Wikidata reached stable stage on Commons
    • 2017: tab for adding properties for files (a.k.a SDC) 
  • Slide: General design of file infoboxes
    • Displayed in the language of the user, but can also take "lang" parameter to force specific language
    • Provide uniform look-and-feel
  • Slide: Goals of Infoboxes
    • Original (pre-wikidata) goals: preserve metadata, present metadata to the users, machine readability of metadata.
  • Current goals:
    • Manage many data inputs (Wikitext, Wikidata, SDC, ). In case of mixed inputs allow comparison of data and transfer of metadata from Commons to Wikidata
    • Remain fast (minimize number of entities loaded)
  • Slide: Preserving metadata during upload
    • Parse information in website (e.g. Web Gallery of Art) and add it to the file at Commons
  • Slide: Comparison of Template:Artwork wikitext-only and Wikidata-only code
    • Code in wikitext requires  more than 15 lines of code while the template only uses 11 characters: "Template:Artwork".
  • Slide: Comparison of Commons and Wikidata metadata and other maintenance categories
  • Slide: Transfer of metadata from Commons to Wikidata
    • If data is missing at Wikidata, but exists in Commons, then it is added to Wikidata
  • Slide: Speed
    • Important to ensure positive user experience
    • How to keep the code fast
    • Minimize Wikicode part of the template
    • Avoid loading full entities
    • Put limits on properties that might multiple values. The Template:Artwork template has field "depicted people" relying on "depicts" and checking each item is a human. The template only checks first 50 depicts.
  •  Slide: Template:Information template
    •  It is used on 70M pages and changed very little in 16 years of existence.
  •  Slide: Module:License
    •  Information about the license of an image is already stored as SDC.
    •  Adding Template:License would be enough to render it.
  •  Slide: How to rewrite high-use infobox in Lua
    •  Post info about your palsn and invite others to test the code.
  • Slide: Challenges
    • Maintenance categories do not update after changes to Wikidata/SDC. Workaround: need for frequent use of "touch.py"
  • Screencast
    • Create a Wikidata item for a painting that doesn't have one. Before doing this, there are two other buttons for making sure that the painting doesn't exist: perform ElasticSearch and SPARQL query (both the search in ElasticSearch and the SPARQL query


💬 Questions & answers:

  • Who's maintaining these templates? Do you feel like you have enough people around who know Lua well enough to maintain and improve them?
    • Not enough, we need more people who know Lua
    • But also ability to work with each other (not trying to rewrite everything without talking to others)
  • When you talked about the Template:Creator template, you mentioned maintenance categories. How are these maintenance categories useful?
    • Jarek: The categories are created by the template.
    • These templates could help, for example, to find an author whose place of birth is known in Commons, but the information doesn't exist in WIkidata.
  •  I've heard about scenarios where an entity in Wikidata was deleted so the template in Commons that used that entity couldn't be correctly rendered. To avoid this scenario, Is there any way users can see if a given Wikidata item is being used by a template in Commons?
    • Jarek: This scenario is pretty rare. I don't think there's a easy way to look before deletion if (a Wikidata item) is used or not. 
  • (Not just for Jarek, but perhaps for the entire group) Commons file Wikitext usually, but not always, contain headers - eg ==Summary== and ==Licensing== - can we start dropping these as well, or will people then also get upset? (it would simplify the way in which tools can do batch edits)
    • Note: This question wasn't asked to the speaker because there were some problems with Etherpad during the presentation.
  • How do you prioritize your work? I'd be interested in also having the template Template:Specimen SDC-powered (it is not at all yet). I guess there are dozens more popular templates in the same situation.
    • Jarek: Kind of prioritize work by amount of how much is being used.
  • How can other Commons contributors help you with this work (also non-coders)?
    • Jarek: Definitely going to the maintenance categories. There's plenty of work there. Using the current infrastructure to create new items also helps.
  • Where should issues with the templates be published? Discussion pages of those templates?
    • Discussion pages of those template and discussion pages of those modules.
  • (Question for the entire group) Do we have any tools for batch updating / simplifying Wikitext information templates, that can be used by non-coders / Commons contributors who do NOT operate a bot?
    • Jarek: People that don't operate a bot can use QuickStatements.
  • How we can have batch access to ingest/update data in wikidata? (Alexandre)
    • Jarek: Anybody can upload a big batch of statements in one go with no special bot. I might be wrong with the policies on that, but that's my impression.

Adding and Subtracting Wikidata for linguistic analysis in specialized domains edit

✨ Number of participants: 16 (18:02), 20 (18:03), 24 (18:06)

🔗 Slides and useful links:.

🖊️ Notes:

  • work by IBM Research on Wikidata.
  • Reusing Wikidata in industrial KGs.
  • most clients already have data for their domain.
  • that data and how they process them is how they make money on. There are legal requirements of privacy ad confidentiality.
  • they don't have ontologists or data scientists on hand
  • Question to address: how can they participate in LOD effort while complying wit privacy and confidentiality?
    • by adding and substracting: carve a section of WD that is relevant to a specialised domain
  • ITOPS: building WD centric a domain agnostic pipeline for domain specific ontologies
  • use case: IT Operations
    • started 3 years ago: problem : one of clients has database of charts and documents have questions. want to have a robust app where users can search, browse and query for relevant information
  • Relevant links
  • We do not have millions of data, so we build a domain specific ontology for IT. min user curation, max user control.
  • able to reproduce for other industries (not only IT)
  • carve out a part of data
  • solution: 3 stage process
    • illustrates 3 different uses of WD and WMF. each build on the prior one.
    • stage 1: user tell us what the scope is (positive and negative concepts). extract the concepts interested in
    • WD and Life itself doesnt have formal criteria for Tbox and ABox (instance or concept?)
    • therefore, everything is the same, its an abstraction, a concept. Instance as a singleton. simplifies alot on how we process WD. really capturing all the vocabularies in IT
    • problem: ontologies for domain experts. Hence we have an overlay. Doesn't need to be connected, like a set of object. solution: a general library of objects (200 objects): a very high level pattern.
    • its like a sandbox. so we don't over extend our graphs with other information and answers.
    • Stage 2: can we do a little bit better? for e.g. DBpedia. what we did is to actually look at the categories. very good categories (can be used to search) vis-a-vis concepts we are not interested in
    • Stage 3: what happens when things are not in WD? e.g. glossaries. Can we find the similarities in S2?
  • We did this for 3 or 4 domain and able to create rich ontology in 2 or 3 days
  • ITOPS Evaluation: would we get a better quality answer? Yes. Having all the good quality WD data helps.
  • ULKBL (Universal ... Knowledge Base ...) Wikidata-centric Interlay. how do we talk and understand the words we use, verb, etc associated to these concepts? Federated data network: designed for linking, tracking and aggregating context. federates the linguistic entities.
  • why WD is the cornerstone of these overlays.
  • about... 9000 unchanged properties.. we end up with 1800 properties. get a good mapping between those properties ad propbank. see bear example in slide 16. many nice relationship of these properties with propbank. in our case we want to have something that is a little bit more about the reasoning. we have taken upon ourselves to look into these properties. Hence, a very good semantic snapshot of what Wikidata is telling you.
  • Elements of the ULKB graph. RRP upper ontology, mappings that describe semantic relations.
  • How we implement ULKB. our hyperknowledge graph (HKG)
    • HKG model: 1. N-ary edges, 2. data content, 3. Composite nodes (represent graphs within graphs)
    • see layout of HKG framework in slide 21
  • structured access to parts of KBs. able to select subsets of these KBs. attaching or associating query to the context.
  • Pandemics Metaknowledge. Analyse complex entities, how we can use these data. think about complex events. we found that the properties (in Wikidata) in these complex events are not very useful. we analysed all these instances, what properties were used, how was it used etc. there are also other subevents that are hard to deduce. properties may also suggest subevents. see slide 26. we use WD to figure out the structure of these large events, and what they are. these doesn't follow from normal ontologies. 3 questions on complex events that WD can help us. what newsworthy event is being described, what are the implicit properties and implicit subevents may happen?
  • Where we want to go? Some well-known lexical resources. what can we give back to WD?
  • lexical resources. 34 open wordnets currently mapped to the English Wordnet. the ILI (collaborative interlingual index independent)
    • Propbank, Verbnet, FrameNet and the SemiLink
    • Mappings are useful but expensive!
    • mappings are hard to maintain across versions, usually incomplete
    • Wordnet mappings (ILI still under development)
    • Many standalone research experiments (many papers, manual, and auto)
  • Modelling LR in Wikidata: slide 31
    • Some initial proposal of changes
    • this is partial information, more from Propbank and VerbNet can be incorporated
    • patient is the theme of ARG1 of locate and not of all ARG1
    • Provenance vs links to the original datasets (VN, Propbank), where?
  • How data can be ingested programmatically
  • Ingestion of datasets (and versions) vs links to datasets (and versions)?
  • external endpoints? github?

💬 Questions & answers:

  • How easy or hard is it for your users to give you the initial description for the domain they are interested in?
  • Is there important info in Wikipedia categories that is not yet captured in Wikidata statements?
    • Yes. we have found that many of these items in the categories are connected somehow but not connected in Wikidata (about 10%). A study moving forward to link these categories are extremely useful.
  • Do you have a process in place to fix mistakes or filter vandalism in the data you take from Wikidata? And do you have a process to report these issues back to Wikidata?
    • Shamefully no. We have a process to correct the mistakes (locally) but no process to report to Wikidata right now. all of these are still very experimental. but now we want to create a more formal framework to report back to Wikidata. Trying to create a repository to collate all the problems different teams have.

How Wikidata powers ScienceStories.io edit

✨ Number of participants: 19 (17:01), 24 (17:11)

Do stay in touch! Feel free to contact me at alan.ang@wikimedia.de

🔗 Slides and useful links:.https://commons.wikimedia.org/wiki/File:ScienceStoriesDataReuseDays2022.pdf

🖊️ Notes:

  • How science stories can power other apps too
  • begin in Oct 2017. Kat: data; Kenneth: software engineer
  • to create biographies that would be easy for people to share and to weave into their own work, teaching and learning
  • 2017: started with 5 stories. chose these stories to honour a specific event at Yale
  • How we use Wikidata in ScienceStories.io
  • Create new items for scientists
  • add statements to the items with referenes back to the source of information
  • Write SPARQL queries to pull data from the item for the people as well as from related items to display in the stories
  • Collaboration platform
  • Inspiration for new elements
  • Added more than 3,000 statements referenced to the Biographical Dictionary of Women in Science
  • We replace P2093 'author name string' with P50 'author' plus the item for a scientist on scholarly publications
  • We read newspapers, journal articles and news sections of professional organisations to find tributes written by colleagues and students
  • Tell us about a Scientist
  • nominate someone for science story creation
  • Currently centered the lives of people who have already passed, but living is also possible (with consent)
  • Partnering with other projects
    • Ghent University in Belgium
    • Partnering with Software Heritage to build a digital computing museum of legacy software
  • Use case: Yale Digital Dura-Europos Archive

💬 Questions & answers:

  • I'm going to nominate Yale astronomer Beatrice Tinsley, a New Zealander who was a pioneering cosmologist. https://en.wikipedia.org/wiki/Beatrice_Tinsley (Mike Dickison)
  • To increase the number of stories, does it make sense to start importing data/ stories of scientists noble prize winners?
    • Yes we do already have many nobel prize winners. but we need to search beyond that to identify and include scientists from diverse backgrounds
    • OT we have all Nobelprize winners in Wikidata and Nobelprize.org also have same as Wikidata in its API example http://api.nobelprize.org/2.1/laureate/1004. 1004 is this Wikidata property
  • For what I could see, Science Stories seems to build stories on the fly for arbitrary entities in Wikidata, so what implies nominating someone and work on its story? Does it mean curating its information?
  • Are all the images shown on sciencestories.io retrieved from Wikimedia Commons?
    • the vast majority are
  • Comment: It would be interesting to see the scientists that are related to a given field (e.g. computing, mathematics, physics, biology, etc.) to discover new stories.
    • awesome. working on a feature right now.
  • Does Science Stories retrieves data from other sites besides Wikidata?
    • try to find ways to have API to give public information about our subjects. support vimeo, videos etc
  • Do you support award rationale https://www.wikidata.org/wiki/Property:P6208 . This property is used in all Nobel prizes see SPARQL https://w.wiki/4xqG Notepad getting the data https://github.com/salgo60/open-data-examples/blob/master/Nobel%20API_2_1-motivation.ipynb
    • we don't currenty support that, but it's a great idea. we can do that
  • Is ScienceStories limited to Yale affiliates?
    • No
  • it would be great if one could always see the SPARQL query for each view
    • logic is to preserve the use of API
  • Are the additional information (i.e. YouTube videos, IIIF images, works from HathiTrust, Internet Archive) about scientists stored in the Science Stories database?
    • yes, or a link to the reference

The Wikidata Infobox on Commons edit

✨Number of participants: 26 (15:03), 30 (15:09), 34 (15:28)

🖊️ Notes:

  • What it is: small infobox to add context for category contents, without hiding category contents
  • Using Wikidata info only (no local definitions), completely multilingual
  • Using WikidataIB module
  • Can also see which images are in use on Wikidata, direct from the category
  • Auto-categorisation (e.g. names, birth/death)
  • Has only two modes: people, everything else
  • Adapting to community wishes/change requests/ feedback via the template talk page
  • Timeline: started in 20 Jan 2018 with manual deployment
    • 1,000 manual uses by 25 Feb 2018
    • Bot deployment started end April 2018 (after 10k manually)
    • simultaneously: bot task on Wikidata to copy P373-> sitelinks
    • by mid June 2018, 1 m uses; May 2019, >2 m uses
    • Now: over 3.85 m uses 
      • "Commons is the Wikimedia project using Wikidata the most ❤️" (Léa Lacroix in chat)
    • There are more infoboxes on Commons than on EN Wikipedia
  • Challenges:
    • P373 Commons category still exists and causes alot of confusion
    • Ships on Commons have 'IMO" categories with ship name subcats (same for planes)
      • solution: P7782, category for ship name
    • 'but we use other tools': taxons and dates added late (but now included)
    • suggestions on talk page that aren't technically straightforward
    • RexxS left the Wikimedia projects (he was doing the Lua coding)
    • Syncing enwiki links to Commons through Wikidata: cleared about 20k mis-links
  • How it works:
    • Works via the Wikidata sitelinks
    • needs sitelink from Wikidata
    • either in the topic or category iten
    • Bot deployment once sitelinks exist (Pi bot comes along daily to add infoboxes)
    • You can add more links
  • Backend: WikidataIB
    • most data is fetched using WikidataIB, which formats Wikidata info
    • Has lots of documented parameters (see presentation) no control over this when setting the infobox
  • Tracking categories (can be found in Wikidata infobox maintenance)
    • use for authority control tracking
    • can find missing data e.g. categories without images
    • lots of things to improve
  • The future
    • Rewrite in Lua
    • Many more categories (only 3.8m out of 7m)
      • **** Lots of notable places without a Wikidata item (but really should have)... but notability issues
        • Was creating new items for people categories (with bot task), but bot was blocked due to notability concerns
        • lots of intersection categories, which wikidata people don't want 
    • More challenges/ requests
      • big backlog on talk page
      • constant balancing act
      • lots of matching Commons and Wikidata ontology still needed
      • lots of cleanup work needed
      • lots of bad links to Commons on various Wikipedias that could now be fixed using the sitelinks
      • lots of manually defined content in Commons categories needs manually moving to Wikidata. e.g Ships
      • really big Wikidata items break things
      • new WMF search emphasises media over categories
      • Still not sure whether Google etc can handle indexing it (same issue as Wikidata.org itself- don't expect multilingual info) Google search results expect each website with only one language. 
      • Infobox can work on any other wiki as well (but limited uptake elsewhere so far)
    • Less time to load and simple module - "Maybe Module:Databox <https://www.wikidata.org/wiki/Module:Databox>  will help?" (Bodhisattwa in chat)
💬 Questions & answers:
  • small question about the P373 deletion: all the needed code work in Wikibase is now done, right? or was there something still missing? —Lucas
    • (Note that P373 is widely used in other infoboxes (500+ templates in 230+ projects), deletion will affect all of them)
    • main argument against P373 now is that sitelinks are slower in queries; code to make sitelinks faster / easier to query might help
  • Why does https://www.wikidata.org/wiki/Q548408 ["anime and manga art book"] show up in infobox on https://commons.wikimedia.org/wiki/Category:Book_arts when https://www.wikidata.org/wiki/Q8305653 ["Category:Book arts"] is the Wikidata item with the Commons category in the site links? -Aap1890 (I had to leave early, but would love to work on cleaning up this set of items! Feel free to reach out to me if you want to work together https://www.wikidata.org/wiki/User:Aap1890)
    • "it looks like Q548408 is a conflated item and that the  labels are different in different languages" (Jan Ainali in chat)
  • I'm not clear about P910 https://www.wikidata.org/wiki/Property:P910 and P301 https://www.wikidata.org/wiki/Property:P301. Do they need to be used in addition to the sitelink in order to power the infobox? Or is the sitelink sufficient? +1
    • depends on how much info is on the topic. 
  • Is it possible to use SDC ("Structured Data on Commons") in the infobox (use Commons wikibase to create items about categories, that are not welcomed on Wikidata and use these for intersecting categories for example)?
    • yes (technically possible) and no (not a good idea, probably duplicated data)
  • Does the bot update if someone redirects Commons category but does not update P373 on Wikidata?
    • sort of
    • a lot of work remains to be done on P373 cleanup

TXTWerk: Natural Language Processing with Wikidata Knowledge Graphs – Examples and lessons learned edit

✨Number of participants: 21 (18:01), 24 (18:05), 21 (18:44)

🖊️ Notes:

  • Neofonie GmbH: digital agency 
  • Ontolux: research department that design machine learning and artificial intelligence solutions for text analysis. Product developed: TWTWerk
  • What is text analysis?
    • Classification (e.g. category by topic), sentiment analysis, etc.
  • Named entity linking
    • use lots of statistical information gathered from Wikipedia. Wikidata is used the most.
    • Wikidata helps in the "Disambiguation" step
    • Linking entities in text in Wikipedia is difficult because of Wikidata items with the same label. For example, there are multiple items with english label "company".
    • They resolve this ambiguity.
      • They gather all the other entities in the text. Build a list of those entities. (document vector)
      • See if the gathered entities are related through Wikidata properties)
  • Challenges when working with WD
    • Tries to directly link an item if its Label is sufficiently unique, however that produces false positives if something is both a proper name of a thing and common in natural language. E.g. "Like" is also the name of a village in Bosnia and Herzegovina
      • The Wikipedia page name is often less ambiguous than the primary Wikidata Label
    • Coverage of common properties across organization is less than expected
      • even the most common properties (hq location inception, off. website) cover only about half of the organizations
      • property coverage falls off strongly after that
    • Sometimes there can be different but theoretically equivalent ways to get the same structural data (for example first-level subdevisions of states)
      • this can surface inconsistencies in the data
        • Unclear what to do: Treat one way at the ground truth? Use the intersection of results as the most correct set of data? Use the union of results as the most complete set of data?
💬 Questions & answers:
  • salgo60: One problem I see with geographic items is name changes. I feel we need a pattern for how to manage give me the correct name for a date Question: Do we have a discussion about this? 
  • Bertram mentioned that links between Wikidata items of words in the same paragraph are considered to do named entity linking, how do you handle the scenario of no direct links between those Wikidata items?
    • Tries to create a geometric graph of how related the entities are to each other. Goal is to use direct connections of contender entities to the already identified entities. However, 2-step connections can also be used. Also, it makes sense to weigh Entities, e.g. United States of America is related to a lot of other entities and thus has little informational value, whereas some rural Town is related to only a relatively few entities and thus have a high informational value
  • There is a DBpedia like dataset called YAGO I believe which has done merge research and implemented a more clean data set.: https://yago-knowledge.org/getting-started#what-is-yago
  • Have you tried fixing some of the issues you encountered already? If yes how did it go? if not what is keeping you from doing it?
    • Unclear how. E.g. neofine has a list of a labels that are also common words (for example: "3") and thus maybe suboptimal labels, but it is unclear what to do with that list
    • suggestion for best contact point for questions and suggestions to the Wikidata community: https://www.wikidata.org/wiki/Wikidata:Project_chat
  • how did you list up the frequently used propeties for organization? (query timed out when I tried to get)
    • -> Not all the types of organizations but some specific types of them were the target and the number of properties used in them is just counted up. (hearing the answer it seems that they didn't use wikidata query service for that)
  • Is txtwerk based on Spacy or Hugging Face?
    • Not "based on", but uses libraries from them
    • esp. Hugging Face is interesting but very  resource intensive and thus harder to run in production
  • Patricia mentioned the approach their used for getting the list of territories within Germany: Was this approach chose because it could also be applied to any country in Wikidata? or because Germany is an atypical case?
    • could also be applied to any country in Wikidata (so we could use it for consistency checking)
  • Do you think the entity-extraction would ever be possible locally in the browser? (without a server)


Building family trees and other diagrams with Wikidata edit

✨Number of participants: 25 (17:03), 26 (18:15), 26 (17:46) 🔗Slides, useful links & notes:

  • Royal Trees
    • https://royaltrees.co.uk/
    • show royal family trees for many famous families. its a historical project. 99% data comes from Wikidata
    • 4 different enquiries into Royal Tree
    • colour coding of boxes shows relationships e.g. children of different parent in the royal family
    • allows multiple trees to run alongside each other
    • Royal Houses:
      • 750 (most from Asia and Europe)
  • Linked People.net
    • https://linkedpeople.net/
    • slides: https://commons.wikimedia.org/wiki/File:Data_Reuse_Days_2022_-_Linked_People_Project.pdf
    • Twitter: @_linked_people, @aminbits
    • started as a personal project.
    • Pipeline for movies, books and TV series (Wikidata, Wikipedia API), image extractor, face detection, image selection--- all locally stored
    • Pipeline for people and characters (Wikidata, Wikipedia API), graph builder, local storage, visualization
    • Family tree metadata: data presented in a way that can be reused by others
    • Browser extensions: you will be able to see this icon next to the name if there are relationships 
    • Problem:
      • incorrect data on Wikidata... Homer Simpson example
    • Also works for non-human families, such as endangered parrots: http://linkedpeople.net/person/Q7530532
  • Entitree
    • https://www.entitree.com
    • Connection with Geni
    • Entitree-flex, package developed for Entitree to show a compact tree as new entities are shown or hidden in the tree.
    • There's an option for showing or hidding the background in images
    • Currently building family tree data on Wikidata is very manual. It would e.g. help if relationships are automatically added to the related entity. 

💬 Questions & answers:

  • Royal Trees
    • Are you aware if whether any of the TV/ film productions (e.g. The Crown) uses Royal Tree as reference/research tool?
      • I very much doubt it.
    • Are you aware of specialized databases/research projects about royal families that we could import/use as sources in Wikidata?
      • not expert on this. Have not really ventured outside of Wikidata
    • Have you ever thought of generating the description directly with Wikidata instead of using Wikipedia?
    • Can "Find Links Between Two individuals" be used with any two humans (not necessarily related to the royal family) in Wikidata?
      • Yes, any two humans on the database
    • what kind of algo you use to traverse
      • php on server
  • Linked People
    • How do you deal with characters that have several identities ? or exist in several timelines or universes ? or have a real person and a fictional character? Can you show us an example?
      • some characters are real-world people so they can be browsed as usual family trees (this links movies to family tree of real people). At the momemnt, I have no solution for timeline and universes (as those in Dark TV Series!)
        • hehe that's exactly the one I had in mind :D
      • One character multiple actors issue: This impacts the roles that are playes by more than one actor and the table on the family tree page (left bottom) should be updated accordingly. At the meomemnt we show only one person there which should be improved in future..
    • Do you offer the possibility to edit data directly from your interface? If not, is it something you would consider for the future?
      • No, in our design the data source will always remain Wikidata but we periodicaly update the cached pages.
    • The browser extension is cool. Have you seen Entity Explosion (https://www.wikidata.org/wiki/Wikidata:Entity_Explosion) yet? Maybe there are synergies.
      • Yes, I've seen this and it is very interesting. The Linked People extensions inject links directly into the Wikidata, etc pages and there is no dialog.
      • I had a gadget script doing similar links to Wikidocumentaries but it has stopped working. I would be happy to learn about how to create such extensions, or perhaps debug the gadget problem.
  • Entitree
    • Where does "add missing image" go to? More generally, do you offer option for people to go back to Wikidata/edit Wikidata from the interface?
      • It points images.entitree.com (needs just a google login) - anyone can add images to an item, so that if the wikidata image is not enough (but you can't upload one because of copyright for instance) that will add the image to the list displayed in the box
    • Do you know how Entitree is reused/embedded in other tools? (such as Conzept):
      • See the code here: https://github.com/waldenn/conzept/tree/master/services/entitree (then grep for CONZEPT PATCH, to see the about 10 lines that were changed. Note that Conzept uses a reverse proxy for entitree, to enable HTTPS and hide the PORT. If you need more support feel to contact me via Twitter.
      • No we're not monitoring that, we intentionally kept the ability to embed it in an iframe - caveat that the licence is GNU 3 so if you use even a part of the code, all the code becomes open source 
    • I am a happy user of Entitree in Wikidocumentaries see for example https://wikidocumentaries-demo.wmflabs.org/Q937. I would be happy to replace the link roots with links to Wikidocumentaries. Would that be possible?
      • No such properties yet in Wikidata. 
      • https://wikidocumentaries-demo.wmflabs.org/Q9682?language=en
      • Related question: I can see that the URL to trees contain the label of the property and the Wikidata item used to build the tree, are there any plans to make URLs also consider QIDs and PIDs? This could make sharing URL to trees easier in some cases.
      • Yes QID works as well as the Wikipedia slug of the article, as a matter of fact if the article doesn't exist the URL falls back on the QID, PID unfortunately not supported, but can be easily converted

Wikidata for data journalism (with R) edit

✨Number of participants: 25 (16:05), 24 (16:33), 26 (16:35). 26 (16:44)

🔗Slides and useful links:.


🖊️ Notes:

  • We coordinate the European Data Journalism Network
  • goal is to create tools for data journalists
  • there are other Wikidata packages for R: 
    • WikidataQueryServiceR
    • WikidataR: 
  • R has grown prominence in data journalism thanks to tidyverse. 
  • What's the matter? (Current problem)
    • R users probably hate SPARQL or simply don't know about it
    • Wikidata is not compatible with tabular logic / tidyverse out of the box
  • tidywikidatar: new package (less than a year old) 
    • everything in tabular format
    • always the same shape/logic (predictible)
    • one row, one piece of information
    • easy local caching
    • get image credits from Wikimedia Commons
    • include Wikipedia in the process
  • Examples of what can be done with tidywikidatar
  • classic wikidata search 
  • Features of tidywikidatar
    • tidywikidatar can obtain qualifiers
    • tidywikidatar can search Wikidata through Wikipedia. Some people are more familiar with Wikipedia than Wikidata.
    • Get metadata of images at Wikimedia Commons
  • General issues:
    • slow approach if processing many thousands of items (before caching); no obvious long term solution (maybe sharing the cache)
    • no easy way to "give back" to Wikidata
      • possible integration with Quickstatements
  • Web interface and mapping issues
    • hoping out by summer
    • make it more generic
    • consider data issues. streets have own item on Wikidata
    • consider 'give back' new data to Wikidata
    • what to do with direct outputs e.g. lists of people with streets dedicated to them but not on Wikidata


💬 Questions & answers:

  • I really like the tw_query() apporach for simple queries. Is it possible / do you have an established workflow on how to add optional fields? for instance, show country P17 if it exists.
  • Is it possible to use the tool for Wikibase instances other than Wikidata?
    • hmm... haven't looked into that yet. 
  • Are there equivalents to this in other languages that you know of? (e.g. Python)
    • Haven't seen it yet actually. I'm not much of a Python user. 
    • Would not be difficult to implement based on tinywikidatar (biggest problem was around caching)
    • If anyone knows, please contact Giorgio! (e.g. to ensure interoperability)
  • For understanding how to query the data of Wikidata from R, is there a specific documentation at Wikidata that you mostly consult?
  • In general, have you noticed EU journalists are helping curate and improve Wikidata/Wikipedia as they use it as raw material for data analysis? I've heard the BBC has done this in the past (and perhaps still does?), but I've yet to encounter American news orgs who do this kind of thing...
    • journalists contributing to Wikidata: OpenSanctions.. etc. 

Wikidata use in KDE’s travel apps edit

✨Number of participants: 20 (15:05), 22 (15:39)

🔗Slides and useful links:.

🖊️ Notes:

  • KTrip: public transport journey planner
  • KDE Itinerary- ditigal travel assistant: import travel documents and recognises your travel plans and put into a timeline, and inform you of delays, forecasts, updates, maps, real time status of amenities, train coaches etc
  • No online access for data
    • logos are loaded on demand and then locally cached
    • intention: privacy, benefits for client performance
  • Data is pre-processed and shipped with apps
    • side effect: sanity checks/ validation
    • data is bundled with the application (1 MB)
  • Data from Wikidata: 
    • 1) Airport Identification problem
      • identifying the airports is a major issue. IATA airport code (P238) 
      • Airport codes change over time and can be re-allocated (especially problematic for airports no longer in operation, imaginary airports that only operational in future), 
      • exceptions to uiqueness (not as unique as one hopes for) some airports have multiple codes assigned or same codes assigned 
      • import errors in northern America and Argentina, 
      • use for train stations
    • Name
      • Cities with same name can cause complex disambiguation logic
    • main info we need from airport: Location (P625)
      • where we want to go vs where the coordinates (from Wikidata) are shown: different location
      • no information on Wikidata or OpenStreetMap to deal with this (concept "location of public entry" is missing from Wikidata)
      • hence both datasets are used to minimalise errors
    • 2) Civil vs Military Airports
      • **** ***** we are only interested with airports with passenger services
          • in some countries, military airports have IATA codes, and these are problematic for us
          • so far no nice way (yet) to query/ restrict this separation
    • 3) Train stations: largest set of data we use from Wikidata
      • more complicated than airports because no international way of standardization 
      • Numerous identifier systems
        • often operator or country specific
        • Specialization of P296
      • we don't do name based identification for train stations as too ambiguous 
      • Challenges: 
        • use of IATA airport codes for train stations
          • replace domestic flights by high speed trains (in France). they assign airport codes to train stations. when train stations get same codes as the airport, it becomes tricky.
        • modelling multi-part/ virtual stations (e.g Frankfurt International Airport's train stations) 
          • technically separate stations considered as one by routing/ ticketing
          • virtual stations have their own identifiers
          • someone needs to decide how these scenarios should be modelled
    • 4) Country information
      • Country codes ISO 3166-1 mappings moved to iso-codes data set; UIC country code
      • Used power plug compatibility warning e.g. from Germany to UK (need converter/ adaptor). Data is  obtained from Wikidata.
      • Driving side
      • The usual fun with countries... 
    • 5) Time zones
      • very important. users might miss train/ flights if we get this wrong
      • P421
        • IANA timezones ids
        • UTC offsets for normal time/ DST
      • UTC offsets lack information about DST transitions
      • Using a coordinate-based index rather than Wikidata now
      • two week periods twice a year where its impossible to schedule (daylight savings) 
      • solution: look at IANA timezone based on geographic coordinates
    • 6) Public transport lines
      • Line logos and colours
      • transport mode or product logo
      • usually that's what used in local signage, so its important for users to quickly match and find their lines
      • Challenge
        • identification. names are highly ambiguous
        • use OSM to obtain a bounding box for each line  (which works. Wikidata + OpenStreetMap data). works in most places except some places we need "mode of transport" as third criteria
        • use name + coordinate to identify a line
        • relies on Wikidata <-> OSM mapping (P402). added/ extended for a number of cities, in both directions
        • concept of transport modes are extremely ill defined. Not a Wikidata problem, its a transport problem in general. 
        • Complex modelling
          • distinction between modes of transport is very blurry
          • e.g. amusement lines in theme parks or historical lines
          • level of detail varies
        • Correct attribution for CC-BY licensed imaged assets (images are from Commons). hence the licenses maybe different and will need to be checked.  

💬 Questions & answers:

  • How often is the data in the application retrieved from Wikidata updated? Can the user change this period?
    • updates are shipped together with software updates, approx once a month
  • Are you involved in community discussions on Wikidata, e.g. to improve the ontologies or data quality?
    • propose new properties that involves some discussions
    • I would love to be more involved on some topics but have to find the right channel. 
    • topic: transport
    • I think this will be a common problem. We should think of ways to point users to the right places and actively involve them in the discussions relevant to them somehow. If you have ideas everyone, please let us know!  😃 
  • How is the data in OpenStreetMap retrieved? through a SPARQL query?
    • nope. no SPARQL. 
  • You showed in the last slide that some icons are not credited because there's no space. To me, those icons looks like that they can be generated through applying a background color. Why doesn't the application generaate those icons?
  • do you have info on seat numbers and carriages? 
    • not from Wikidata at this point. in theory we get may identifiers on train coaches (down to seats), some operators publish these information but this area needs to be improved. 
  • Interesting. Could you give a screen demo of you using the app? 
    • Probably yes, if I can figure out how to do this via sceen sharing.. *trying to do it*
    • Demo: Input is DB train ticket PDF. Ticket QR code, information on trips, groups trips together, how to get from home to Berlin Hbf (uses logo from public transport operator), map visualization from OpenStreetMap, live status of escalators/ elevators
  • can I use this app when I am in other countries for e.g. India or Japan?
    • in theory yes, but in practice, better in Europe or US. We don't have public transport live data in Japan or India
    • this is mainly needing inputs from locals. Data exists for these countries, but will need to find a way to optimise them. 
  • Have you participated in some Wikiprojects related to transport?


Conzept: An attempt to build an encyclopedia for the 21st century edit

✨Number of participants: 13, 18 (13:42), 15 (13:58) 🔗Slides and useful links:.

🖊️ Notes:

  • interface for WP and WD. Too textual, so wanted to write something for them. 
  • Idea: get WP data and WD data from APIS, 
  • not just articles from WP, but multifacet worlds of the topics. multi dimensions
  • System to manage all those links in one view
  • aim at having fun for researching things for education purposes
  • makes art direction of composable content views
  • images from Commons
  • about 40 apps were built and integrated on top of it 
    • Complete list of apps at: https://conze.pt/guide/apps
    • Compare concepts. Example shown: compare bird species
    • Search items by properties in WIkidata. For example, if we wanted to know more about mobile phone industries in India,  we would use the property "country" -> "India", "industry" -> "mobile phone industry"
    • web-native fused
  • The user can change the language of the interface and the information that is shown if desired
  • Conzept is GPL3 licensed. Jama: I hope people contribute to this project.

💬 Questions & answers:

  • What inspires you to create Conzept? Is it only because of the User Experience? 
    • from youth's experience. A dutch youth experience.
    • makes learning of these topics more fun and engaging and scalable 
    • Wikidata is great, but not exciting. 
  • Do you also use other sources of data besides Wikidata/ Wikipedia? 
  • What was your experience when retrieving data from Wikidata? Did you encounter any issue (technical, structure of the data, data quality, etc.)?
    • see notes: "Wikidata issues encountered" on Conze.pt
      • Difficulty avoiding duplicate items in SPARQL results (there are ways, but it is not easy to generalize to all SPARQL queries)
        • THIS SHOULD BE POSSIBLE AREADY USING "SELECT DISTINCT"
      • Wikidata: claim-statements integration not done yet, because of the data-access is more complex when using wikibase-sdk with claims.
      • Integrating Wikidata localization data for properties.
      • Abandoned Wikidata-based applications.
      • Wikipedia: Image-loading blocking sometimes (“too many image requests”)
  • Do you see a possibility for introducing data science/ machine learning into conze.pt? 
    • possiblilty. experimented before and takes lot of resources
    • see whats the use case
  • The idea about having discussion options around topics is very interesting, how do you imagine it could work?
    • with audio and textual chat in the topic. chat and share ideas 
    • that reminds me of the chat rooms feature started on Twitter not so long ago 
  • You mentioned that you want to encourage gifted childen to learn in a different way. How can we as a Community help you to make Conzept achieve that goal? 
    • trying in a different way. Children nowadays like to learn via videos and images, multimedia, social... 
  • How much can you customize the interface (e.g. display people's favorite topics, make new suggestions of topics)?
    • all the sections are dynamically generated, localised 
    • text to speech 
  • Are there any links pointing to the Wikidata items so that they can be enhanced if needed? Or is the tool's purpose to bring data together for display purposes?
    • latter is more the focus. basically a read-only system to consume info from WP and WD
  • Is the tool not a good candidate for bringing together info on researchers? I did a search for a researcher that is in WD but it is not displaying the correct data? (https://conze.pt/explore/daniel%20mietchen?l=en&t=wikipedia&s=true#
    • So, it wouldn't generate a profile like Scholia does?
  • Did you think of Scholia as an app that could be embedded, to provide more information about researchers/academic papers? https://scholia.toolforge.org/
    • yes. though language support is an issue for now 
    • Scholia is already embedded/linked on some pages
  • How do you deal with the various licenses of the content you gather (CC0, maybe CC-BY sometimes)?
    • sometimes there are license information. other times no such information. Will look into it. 
  • What does "Difficulty avoiding duplicate items in SPARQL results" mean exactly? That the same Q-id shows up multiple times or that one concept is covered by multiple Q-ids?
    • multiple QIDs show up due to complicated SPARQL queries
    • -> option in WDQS to make results (Q-ids) unique would help a lot!
  • Do you imagine having Conzept in augmented reality or virtual reality one day?
    • tried it but the experience is not so engaging at the moment


GeneWiki: The Wikidata Integrator & Biohackathon: report on reviewing Wikidata subsetting methods edit

✨Number of participants: 21 (20:00), 23 (20:18), 16 (20:46), 14 (21:09)

🔗Slides and useful links:.

🖊️ Notes:

  • Part 2. Biohackathon: report on reviewing Wikidata subsetting methods
    • They align those entities with Wikidata. They ran into some issues.
    • When trying to get the DOIs of all items that have such propert, WDQS showed: "Server error: Unexpected end of JSON input"
    • They have been meeting in Hackathons: Biohackathon & SWAT4HCLS. Address the problem: How do we get subsets of Wikidata to do more complex stuff? Here's where subsetting comes in
    • Why subsetting?
      • Current problem: timed-out queries because of the size of the whole data of Wikidata
      • Reducing the overall costs: Now, data can be queried by computers that don't have computational power.
      • Provide reproducible experiments. Wikidata data changes continuously, a query returns different results when executed in different times, with subsetting we have some kind of snapshot of the data and then we can perform analysis on that snapshot.
    • They aim to have a subset creator. It is necessary to define the items that need to be included in the subset. For example, Scholia would need papers, authors, countries. It is sometimes difficult to define the boundaries of that subset.
    • Problem statement
    • WDSub. WShEx supports qualifiers and references
    • WDumper
    • SparkWDSub: Process whole graph using Pregel algorithm. It works but requires large computational resources. Nice thing: Wikidata whole graph could be processed.
  • Seyed

💬 Questions & answers:

  • Thad: Is it worth to stay in Graph land, when dealing with subsets? Perhaps we use RDBMS when we have needs for subsetting?
  • Thad: PostgreSQL has had support for JSON for a while. https://www.postgresql.org/docs/current/functions-json.html And it's known that relational databases have strong indexing support and partitioning via file layers to allow lower memory usage. Has anyone done evaluation of solving
  • "subsetting" performance with PostgreSQL? (I noticed KGTK uses SQLite)
  • Thad: The other idea I had was that from Presto (particularly Aria https://prestodb.io/blog/2019/12/23/improve-presto-planner#aria-scan-filter-pushdown) which allows pushdown support now with ORC files. Has Presto been looked at as a filter mechanism perhaps against PostgreSQL?
  • what is being done to assist the community in exploring subsetting? For instance, https://wdumps.toolforge.org/dumps has been relatively straightforward to use by general Wikidata users, but it has not been producing dumps for weeks. Other tools require deeper technical insight and/ or infrastructure to use. which many Wikidata contributors do not have.
  • Is there a Wikidata page to discuss about ways how to create subsets. It is something where I am interested in. Hogü-456
  • Have you identified Weaviate (vector search technology) as a valid alternative to your SubSelect project ?


Part 1. The Wikidata Integrator

  • Andra showed how to create a new Item for a new scientific paper with Wikidata Integrator in a Jupyter Notebook.
  • Pipeline of ....: Graphic available at ...
  • Genewiki adds data to Wikidata with ^ pipeline. The pipeline checks various things and uses among others Entity Schemas to make sure the data that is going to be imported is well-modelled
  • They consider FAIR: Findable, accessible, interoperable, reusable. Graphic available at https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg
  • Genewiki started with getting gene data into wikipedia infoboxes - issues with working across 300 languages and in text - then Wikidata entered the picture and things got a lot easier as it is structured data with qualifiers and references
  • Community engagement and model discussion: Sometimes Zoom calls. The result was a graphic that show the relationships of entities
  • conversion of modeling decisions into entity schemas to have a machine-readable description of the data model
  • Example shown during the presentation: EntitySchema of a human gene. Wikidata items were then validated with that schema.
  • first model 10 or so items by hand and then scale it up by bot if there is agreement on the modeling
  • bots are run in Jenkins regularly to check Wikidata's data against external sources and keep them in sync. there is a manual review step involved to make sure the right changes are made in the right places
  • Wikidata Integrator is a community project, pull requests are welcome.
  • During the presentation, Andraa showed how he developed a bot, OBO Bot, in a Jupyter Notebook on PAWS

💬 Questions & answers:

  • Have you noticed benefits of using Google Colab over PAWS for running code in Jupyter Notebooks?
    • When you work in PAWS, you can't do live collaboration with others as happens with Google Colab.
    • And you have to be careful not to share your password by accident (I use "import getpass" to avoid that)
  • Do you have a documentation of the community process for deciding about the ShEx structure somewhere? It could be used as a good practice example for other subject areas.
  • You mentioned the problem of CC BY-SA statements and Wikidata. Can you leave an example that we could vet?
    • open question: ontology creation = creative process ?
  • How do you monitor the action of your bots to check that they are working as expected?
  • Is there any repository online where we can find code of the bots that you have written?
    • Andraa: Yes, I'll share it afterwards
  • Can we have a link to your Google Slides presentation?

Scholia and use of scholarly data edit

✨Number of participants: 29 (19:05) 31 (19:20)

🔗Slides and useful links:.

Link list:

🖊️ Notes:

  • Tool to gather scholarly knowledge: How can we reuse the data to profile people, topics and visualize them?
  • Open data, data that is machine friendly, open source and collaborative
  • Reusing the infrastructure that WD provides. Hence same limitations as well.
    • For example, "Rate limit exceeded"
  • Tagging the people with paper and institution they are working with and geolocation
  • construct a timeline is possible too
  • academic tree, showing relationships between academics/ scholars
  • citations stats, citing authors, reusing images from Commons
  • works for a number of scholarly research subjects
  • annotations get richer and richer over time
  • Orcid IDs are reused, create author graph, names of awards and award recipients, citation graph (reusing all the necessary data to pack into one visualization)
  • combined visualization: combining from multiple profiles
  • Mention of Ordia, similar project focused on lexicographical data on Wikidata https://ordia.toolforge.org/


💬 Questions & answers:

  • 1. How has the closure of Microsoft Academic changed things for Scholia or altered any future plans? (have you seen increased usage of Scholia for instance)
    • OpenAlex has started as replacement, and Wikidata is working with that team on collaborations
  • 2. In particular, Microsoft Academic used AI for relavancy, but I understand that Scholia uses Wembedder knowledge graph and SPARQL for co-occuring topics?
    • Wembedder not used in any other visualizations
  • 3. Can you speak about how the team feels about incorporation of AI in Scholia in places in its future ... or more importantly blockers to progress in surfacing hidden "relationships"?
    • Certainly open to that. AI has problems as well. Comes with lots of biases. AI also never 100% precise. Welcome experimentation in this space.
  • Not a question, but feedback: Some low cost computers might have problems with the page of a author because of the Co-author graph, since showing #defaultView:Graph requires more computing power. It would be great if the users had the possibility of toggling off some sections that are shown in each page (i.e. /author/Q123, /organization/Q123, etc.)
  • Is the big number of Scientific Articles in Wikidata a challenge for Scholia and do you have plans to base Scholia on a dump to reduce the Requests to the WikidataQueryService. From my point of view some of the shown graphics could be generated previously and not on request.
    • caching has been on our radar in this grant: https://riojournal.com/article/35820/ Several options have already been explored during the project, including caching like in this suggestion. So far, we love the curation aspect, and then seeing your edits confirmed in a few minutes max, is very useful.
    • Also a nice bridge to the subsetting session later tonigth :)
  • Can scholia be used to navigate: 1. Another Wikibase, 2. The Wikidata dumps now being hosted on Virtuoso or Stardog? (context: https://twitter.com/kidehen/status/1498711763159863297 ) What blazegraph features are used? Isn't blazegraph SPARQL 1.1.?
    • I have run Scholia on top of Virtuoso. SPARQL queries are now tweaked towards the Blazegraph for performance reasons. But if using standards SPARQL, then yes, no problem at all
    • We use the WITH Blazegraph keyword a lot for instance.
  • Newbie Question, do papers have to be in Wikidata? How can they be mass imported?
    • Yes and yes :) There are bots, but adding a single paper by DOI, try: https://scholia.toolforge.org/doi/10.1038/s41385-021-00482-8 (which will redirect if the DOI exists, if not, will show QuickStatements). Just put in your own DOI. Large data set bots exist, but sort of on hold. Generally, focus on articles that support statements (== references)
    • The next talk will start with a demo how to put a paper in Wikidata :)
    • You can also copy-and-paste the DOI into the search field and the DOI will be extracted and the metadata will be requested from CrossRef. (Nice! I didn't even know that)
  • I just found out about I40C , How is that relavant?
    • also check out OpenCitations
    • Hmm, OK, is your feeling that I40C still needs to exist then?
      • yes, there are still some ~40% of citations in Crossref that are not openly licensed, and then there are many more that Crossref does not know about
  • What's the advantage of contributing to Wikidata vs the work that opencitations is doing?
    • I would say: Wikidata is broader in scope, e.g., papers, authors, organizations, genes, etc.
      • indeed, plus it has a (larger) community

Scribe: leveraging Wikidata for language learning edit

✨Number of participants: 19 (18:03), 18 (18:46), 18 (18:49)

🔗Slides and useful links: https://commons.wikimedia.org/wiki/File:Scribe_Data_Reuse_Days_Slides.pdf

🖊️ Notes:

Thanks for coming!

💬 Questions & answers:

  • AFAIK you're one of the first people using Wikidata's lexicographical data for a "professional" app. What was your experience with using the data?
    • so far its fine. Nice experience to be able to get the data so quickly
    • taxonomy is inconsistent -> need for longer more complex queries (that sometimes break in Python)
  • Is there anything that could be improved on Wikidata from your perspective? (technical aspects, ontology, data quality...)
    • little things can be added later on... mark curse words especially (to filter them out in the app)
  • Have you been editing Wikidata yourself to add some data in? If so, did you do it with the web interface, or using some upload tool? How was this experience for you?
    • web interface and really nice experience
  • How do you imagine working with the Wikidata community in the future? What could we do to support you?
    • spreading the word, try to get Scribe users to Wikidata, edits must be made on Wikidata, tools to figure out where data may be missing.
  • The screen limits for the keyboard are untouchable below the spacebar, right? What are differences for Android / iOS for usable screen areas/dialogs?
    • for the iOS app, only have the area of the keyboard to work with. Popups of the keys cannot go above the keyboard. not sure how Android version will look like right now.
  • Do you plan to monetise Scribe? this is such a cool product.
    • how to do it in a moral way? no need for it right now. collecting people's data and using that for monetisation is morally not right for me.
  • Android pretty please?!
    • +2!!! (coming soon!!)
    • Android in Kotlin; desktop in Python
      • +1 for desktop as well
    • Android is the holy grail... so many language learners are on low cost phones and Windows XP / 7
  • How scribe handle diacritics during conjugation for some non-Latin scripts?
    • e.g. Swedish (2by2)
    • Russian switctches from 3x2 to 2x2
    • Verb conjugation reacts to the language
  • Will scribe include dialects in future?
    • yeah! sounds great! have to look at it from the Wikidata side of things
  • While defining what are the most important "hints" for each language, did you form an idea of the ideal ontology of a word (for example, what kind of information a German verb Lexeme should include in order to be useful in the app)?
    • Sorry I was kind of tired at the end there and wasn't sure on this :D
    • Big thing was that I was reacting to what was available on Wikidata
    • Most verbs only have present and two past forms on Wikidata, so those were the ones that I chose to include
    • More will be added later for sure
    • Big thing is that I'd love to add any information in that can be presented to users in an intuitive manner
  • How did you choose the first set of languges that you implemented on the app? (was it because you know them better, because the data on Wikidata is more complete...?)
    • I started with languages that I speak
    • Manaarin would be great to add (but there are not many Mandarin lexemes on Wikidata)
    • Mandarin for second language speakers is usually written with Pinyin and then the character is selected
    • The issue is that a second language learner doesn't necessarily know what the character they want is
    • Adding in WIkidata powered long hold of a character that tells the user what the translation of the character is would be something that would really help a lot of people
    • Ex: "ma" is "mom" and "horse", but then if you don't know the characters well you don't know which to select
    • Again would need translations in Wikidata
  • How about typography rules, do you plan to include them as well? (e.g. is there a space character before the exclamation mark or not, depending on the language)
    • Basic typography is in there already
    • This is something where we'd need to maybe hard code some rules, or also potentially adding in some features later that would make suggestions based with an internet connection + machine learning over the text that the user is typing
  • For suggestion of cases/conjugations: how do you do it when you need more context than just the previous word? (eg "in" in German can be followed by Dativ or Akkusativ depending on the context)
    • Andrew: This is a big thing. As of right now, maybe it's not as helpful for those cases. I haven't really gone into very much the keyboard reacting to Dutch in an accusative case aside from just telling the user: "You should be using dative (case)", "You sholuld be using accusative (case)" or "You should be using genitive (case)". But, this is another thing that it would be really great, you are typing in dative right now, let's make sure that these conjugations or whatever is being shown to the user are reflected of the case that you are typing in.
    • Definitely a goal of the app is to have the commands react to the case that the user is typing in
    • Sometimes this data needs to be added to Wikidata

Data Reuse Clinic edit

✨Number of participants: 24 (17:05), 20 (17:28)

Questions & answers & discussion:

  • When will it be possible to visualise Wikidata query endpoints on Wikimedia sites like maps, graphs etc.?
    • not possible right now. Let's find that ticket and work on it during Bug Triage Hour session next week.
  • is there a way to change the property of an existing statement while keeping all values/qualifiers/references? my use-case is "upgrading" generic P296 statements to more specific ones like P4803 for example
    • that is definitely not possible in the default. not sure if someone has developed a gadget for it.
    • should be possible via the API way
    • someone can also volunteer to write a gadget if preferred (during Pink Pony session on 24 March)
  • Given the value of a string that is an identifier of an item (e.g. a isbn), whats the fastest way to jump to that wikidata item? A sparql query, elasticsearch?
  • what's the best way to get around 414 Request-URI Too Large errors? Some of the queries by Scribe are very large (conjugations of verbs, in particular), and they work via the query service, but then when I try them from WikidataIntegrator I get the above error (trying to have the language update on Scribe app)
  • In one of the talks you mentioned about using a correct User Agent, is there any page I can find those guidelines?
  • What were the things/ difficulties you stumbled upon while building your tools/ projects on top of Wikidata?
    • when I want to make things that make edits, figuring out the right structure to send to the api is a pain
    • especially wrt lexeme json (why? what makes lexeme special?) well, for a while you couldn't do everything with lexeme json that you might do with item json
    • you have to tell it the senses and forms you want to add are new, or it doesn't add them (the "add": "" thing)
      • not the case for new statements -> inconsistent
    • mh, so my takeaway would be that it is not obvious that forms and senses are more like items/entities than like statements. Maybe we could/should present senses/forms differently?
    • I think having better docs about how to make edits via the api would probably be enough. if there's something saying "here's how you do that", you don't have to try and figure it out yourself and get frustrated that it's not adding your forms/senses because you haven't magically realised you need "add"
    • I realized in the discussions that a lot of people are not using Wikidata's ontology and instead build and maintain their own taxonomy. I wonder what level of consistency and stability we would need for them to use our ontology.
      • I stay away from wikidata ontology as far as possible : too inconsistent and a few people you don't want to argue with +1
  • Should we have such a regular place/ time for discussions like this?
    • Within the business communities and other partners, alot of times the questions are where can I get an expert/ plug in to the community cos dont want to spend too much time reading documentation. typically being pointed to the mailing list, but sometimes they want phone calls. There's also Telegram where they can reach out to. Feedback: pretty good. Able to get their answers from the community support. There are various domains and interest areas, everyone has a different interest area. Hence project pages are good for now.
    • I like having meetings/events like this or chatting on telegram, although I'm mostly helping not asking for help
    • I'm open to chats about Wikidata ontological matters

The EU Knowledge Graph and WikibaseSync, how the European Commission is reusing Wikidata data edit

✨Number of participants: 21 (14:25)

🔗Slides and useful links:.

🖊️ Notes:

  • Intro to Wikibase. Software running behind Wikidata. Can set up locally.
  • EU KG: data repository: knowledgegraph.eu
    • Can be queried
    • Can be edited by humans and bots
    • Scales well, multilingual, full track of changes
  • Structured data
    • from Wikidata, check what are the entities that can be reused to model the knowledge
    • properties e.g address, opening hours, occupant... and import them locally into EUKG.
  • Not only import property name, but also property constraint
  • keep local identifiers to link them to other sources
  • Imports are done using Wikibase APIs: Pywikibots
  • currently we are reusing more 100k concepts from Wikidata to EUKG.
  • the data is easily understandable, aligned with existing concepts and queryable
  • Keeping the data fresh: entities evolved over time e.g. info of countries' heads of states. Wikidata community is there to keep the knowledge fresh and up to date
  • The data is maintained fresh by the community is also maintained fresh by the local instance
  • 1) It could happen that locally when there is a refresh, we do not want the local knowledge is lost. Priority is given to the local knowledge
  • 2) Not reimport statement that has been deleted in local WB knowledge into Wikidata.
  • A bot: WikidataUpdater bot: checks every 5 mins for changes.
  • How: Wikibaseync
    • similar to wikibaseimport but it is a Pywikibot bot, so you do not need access to the machine!
  • Current content:
    • EU institutions, countries, capital cities, buildings, heads of states, DGs etc..
    • 1,164,619 Projects co-financed by the EU
    • By reusing WD data, we discover so much more information regarding the beneficiaries of EU co-financed projects.
  • Linked Data Solutions: linked data solutions used internally by EC
  • Why EC doing this? (Kohesio)
    • EU is financing cohesion policy, supports in tens of thousands of projects across Europe.
    • Generally done in Excel sheets.
    • Kohesio is to bring all these together in a transparent manner


💬 Questions & answers:

  • I am curious about WikiBase if you don't mind, does it require a Triple Store or what were you using?
    • WB is like WD, you get the triple store, UI to edit it, a database
  • Does the EU knowledge graph give back to Wikidata by uploading some data to Wikidata? (Do you have content/knowledge that would make sense giving back to Wikidata? Are you giving back content already?)
    • not the case currently. Will be important at some point. Most data are Eurostat data..not active ingesting.
    • Adding to that question: What would help you to give back?
    • Lydia:
      • But they are giving back with great promo like this 😉
      • And feedback for the development of Wikibase and tools around it
  • Do you use a technique/method to avoid vandalism in Wikidata be inserted in the EU knowledge graph?
    • basically change the knowledge locally, and when done locally, this knowledge cannot be changed
    • main entities (project financed by EU) not in Wikidata
    • Often trust WD data, has high quality.
  • Are the entities (e.g. offices) created directly in the EU knowledge graph, also created in Wikidata?
    • No. Wikidata should not a repository for everything.
    • The main reason to set up a Wikibase is to be able to ingest data that should not be in Wikidata.

Building a simple web app using Wikidata data edit

✨ Number of participants: 31 (20:05), 29 20:19, 23 (20:38)

🔗Slides and useful links:

🖊️ Notes:

💬 Questions & answers:

Best practices for reusing Wikidata's data edit

✨ Number of participants: 35 (19:01), 40 (20:08)

🖊️ Notes:

  • Social side of reusing Wikidata's data
  • Wikidata is a commons. We should give something back to Wikidata to ensure it stays around for a long time.
  • Doing right by users by getting them the best data they can.
  • Protect your reputation in case of vandalism.
  • Want to ensure that Wikidata is healthy and growing.
  • What does it mean to be a good citizen?
    • 1) Amazing Community. collects and maintain and governs data
    • 2) Useful and high-quality data for anyone to use
    • 3) Impactful applications built on top of our data to attract new people to the community to take care of the data.
  • Concrete steps on how to be a good citizen aimed to people that build an application on top of Wikidata
    • 1) Give something back to Wikidata
      • Attention and publicity. Make sure more people know about Wikidata
      • Data improvements from internal quality assurance processes etc
      • Help out on maintenance work. keep an eye on changes to the data
      • Provide expertise
      • Give feedback about what's right or not. where can we improve?
      • Money to support the development and programmatic work
    • 2) Indicate where the data in your app is coming from. To make sure that your users know that your data is coming from Wikidata, so they have a chance to come back to improve the data.
    • 3) Let us know about the errors you find.
      • Small scale: bring up on-wiki (project chat/ wiki project)
      • Large scale: publish regular reports, contribute mismatches to Mismatch Finder
    • 4) Fix those errors you find
      • Wikidata is a wiki. You are encouraged to edit. Uncertainties can be discussed on Property talk page.
    • 5) Introduce yourself and your work on your user page.
      • Disclose if you are paid to edit Wikidata (required by Terms of Use)
      • Let others know who you are and what you do
      • Be honest and upfront about your motives and why you are contributing to Wikidata
    • 6) Try to fix issues upstream instead of working around them.
      • See if they can be fixed in Wikidata directly. That would benefit more people. It should be a default.
    • 7) To keep an eye on changes to content that is relevant to you.
    • 8) Take part in discussions that will have an impact on your app.
      • Editors need to understand how their decisions impact the world of reusers. Your perspective is vital in these discussions.
    • 9) Let the editors know what you are doing so they can take it into account
    • 10) Set up your own infrastructure if you are working at scale
      • Large commercial reusers: encouraged to ingest dumps or recent changes feeds to keep their own copy of the data and work on that
      • SPARQL query service is primarily intended to be used by small and medium-size projects or editors (big re-users can implement their own SPARQL server)
    • If you’re unsure: Let’s talk! We are happy to help. :)


💬 Questions & answers:

  • Do we already have community consensus that reusers outside of Wikimedia can list themselves on property talk pages in
 
This property is being used by:

Please notify projects that use this property before big changes (renaming, deletion, merge with another property, etc.)

?

How to retrieve Wikidata’s data? edit

✨ Number of participants: 41 (17:09), 44 (17:18), 47 (17:33)

🔗 Slides and useful links:

🖊️ Notes:

  • Session is about how to get to Wikidatas data if youre building your own application
  • Some basics
    • A lot of data on Wikidata
    • Many ways to get that data
    • Depending on your intentions, there are better ways to do it than others
  • Some general pointers / best practices
    • follow user agent policy
    • follow the robot policy
    • 429 response - too many requests - time to reduce the amount of requests you are making
  • Access points
    • WDQS
      • API available at query.wikidata.org/sparql
      • Query results visualizations can be embedded in other websites
      • Worth using when you dont know the entities you want, but you know their characteristics
      • Don't use when..
        • FILTER(REGEX(...)) is an antipattern
        • You have millions of users. Recommendation: Run your own instance
        • You expect the results to be a large percentage of Wikidata's entities
      • If your query times out, help in https://www.wikidata.org/wiki/Wikidata:Request_a_query
      • Tip: add timeout to make the query time out earlier
      • Example tools using the WDQS: scholia
    • Linked data fragments
      • A bit more experimental, less supported by WMDE than the WDQS
      • Worth using when youre looking for a list of entities based on triple patterns, when your result is likely to be larger, youre okay with doing --
      • Don't use when [missed this]
    • Linked data interface
      • Available formats: .json, .rdf, .ttl, .nt or .jsonld
      • Use when your want data on a smallis set of entities (esp RDF), you already know the IDs of the entities you are interested in and you want each whole entity
      • Dont use when.. you don't know exactly which entities you want or you want large amounts of data
      • Recommend: URLs without ?revision always reutrns the latest data, certain URLs for a specific revision and format are likely to be cached already such as
      • Example tools using linked data interface: OpenAlex, open source catalog of scholarly articles (url: https://openalex.org/ ) (source code at https://github.com/ourresearch/openalex-guts)
    • Search
      • running elastic search
      • Use when youre searching for specific text spring, you know the name of entities youre looking for, you can filter your search based on some simple relations within the data
      • Dont use when your search involves complex relations within the data
    • Action API
      • MediaWiki's own API, can be explored at special:ApiSandbox
      • Use when you need to edit Wikidata, you need JSON data of a batch of entities
      • Don't use when you want large sections of all entities (use a dump instead), or when you just want to retrieve the current state of entities in JSON
      • Recommend using the maxlag parameter if youre making edits and keep in mind the other recommendations mentioned in API: Etiqeutte
    • Dumps
      • Various available formats: JSON, RDF (all and truthy), XML
      • Mirrors available as well
      • Use dumps when you need data on a significant proportion of entities, or if you want to set up your own query service using Wikidata's data
      • Dont use it when you a restricted wrt to bandwidth, storage space etc, or when you need very current data as the dumps are only updated once a week
      • Recommend not using the MediaWiki XML dumps as these are not considered stable, and you can use the wdumper to get partial custom RDF dumps
      • Example user of dumps: wikitrivia.tomjwatson.com
    • Recent changes stream
      • Great way to keep an eye on what has changed on Wikidata
      • Stream.wikimedia.org, per-wiki feeds in the action API, legacy streams on IRC
      • Use when you need to react to changes in real time, or you want to keep up with everything happening on Wikidata
      • Returns data for all wikis; filter the stream on your end if you only want Wikidata changes.
      • Example project using recent changes stream: listen to wikipedia ( visit to http://listen.hatnote.com/ ), source code at https://github.com/hatnote/listen-to-wikipedia
  • Other things to bear in mind
    • Query builder -- lets you build the queries in a way that may be more intuitive to a non-sparql expert
    • Toolkits that help others -- knowledge graph toolkit, wikidata toolkit
  • What's coming up
    • Together with the Wikimedia Foundation we are continuing to scale the Wikidata query service to deal with all the requests
    • We are also building a rest API and working on Wikidata data access documentation
  • For more info..
    • Documentation can be found at wikidata:data acess which we will be updating with the information from today
    • If you need help with sparql queries, you can request a query on the wikidata: request a query page
    • Other general places to get help: wikidata mailing list, project chat, and telegram channel

💬 Questions & answers:

  • Is ElasticSearch what is used in Special:Search https://www.wikidata.org/w/index.php?title=Special:Search ? If so, what's the syntax for searching? I tried with this search => hasdescription:commons <=, but no result was shown and I got this warning "A warning has occurred while searching: hasdescription keyword contains unknown language code: commons "
  • Does hassitelink or similar exist?
  • Is there a way to get a cursor to a paginated result (e.g., next 500 rows) of a SPARQL bulk query?
    • There is no paginated results so to speed from SPARQL results.
    • You can use OFFSET and LIMIT, but just using these will not gaurentee that the list you are OFFSETTing from is in the same order, thus you'll get unexpected results. You can ORDER your results, but this will likley result in timeouts, as the whole list needs processing.
  • Can we convert API search result to csv ?
  • I'm a beginner in regards to LDF, what would be the advantages of getting the data in that format?
    • You'd use it if your query is comparatively simple, has a larger result set and the query service times out for example.

Wikxhibit: easily build useful websites with data from Wikidata and across the web edit

✨ Number of participants: 30, 38 (16:14)

🔗 Slides and useful links:

  • wikxhibit.org

🖊️ Notes:

  • How to build cool websites from Wikidata data.
    • Wikxhibit allows building websites using Wikidata
    • Present, show a demo, Workshop, show your presentations
    • Goal of Wikhibit: We want to empower anyone to create cool presentations of Wikidata without programming.
    • Uniform identifiers are used to fetch data from other websites that are provided by Wikidata
    • Using properties to query Wikidata in Wikxhibit
    • The values of these properties are actually items
    • Within the html document, able to read the properties, items and label

💬 Questions & answers:

  • Is Wikxhibit mainly for developing prototypes?
    • No, depending how fancy the application you want to create. If you want to fancy things (fetching data and do fancy things). You can do caching but Wikxhibit can't e.g. fetch data every week.
    • I'd say wikxhibit is quite good for prototypes, as it provides a lot of basic functionality. It might not have the fanciest functionalities you envision in your tool but it provides a lot of the basic interactivitiy you need for your prototypes
  • Will it show multiple values if the property has more than one?
    • Yes. you would add the "mv-multiple attribute"
  • Can it display maps?
    • Programmers can add plug-ins. Maps (visualizing coordinates from Wikidata)
    • Likely to be able to, but haven't tested it yet.
  • Can you sort multiple values using a qualifier like 'series ordinal' ?
    • Not supporting qualifier yet. Supporting the main value.
    • Property- value- qualifier. Working on this pretty soon.
  • First used case from example on website: NLW Art: "National Library Wales"
  • How did you get to the page where you could start editing the HTML for Wikxhibit?
    • Click "play" on website.
  • Is the set of properties limited or should any property work? I’m having trouble getting
  • working .
  • I would feel more comfortable if property= also allowed property IDs.
  • <a property=fullWorkAvailableAtURL> worked but I had to guess to remove the spaces

Wikidata for Performing and Visual Arts edit

  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  •  Wikidata for Performing and Visual Arts
  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • ✨ Number of participants:  30, 35 (15:14), 35 (15:35), 37
  • 🔗 Slides and useful links:
  • 🖊️ Notes:
  • Footlight, by Gregory Saumier-Finch
    • Footlight uses the Reconciliation Service API defined here https://www.w3.org/community/reconciliation/
    • Crawls events on webpages 
    • LOD output includes Wikidata IDs, linked bi-directionally
    • Schema structured data output on event page includes Wikidata IDs for schema:Performer and schema:Organizer.
    • Footlight focuses on upcoming events 
    • console.artsdata.ca
    • Items in WIkidata link back to the data at artsdata.ca through the Wikidata property "Artsdata.ca ID"



  • Culture in Time,  by Gregory Saumier-Finch
    • Devleoped as part of Glam Hackathon in 2020/21
    • reusing Wikidata and Artsdata data
    • Culture in Time ver3 (not yet 'live'). 
    • Productions brought in from various data sources
    • Group productions by theme
    • Create an account, add your own SPARQL
    • Spotlight: a way to group together a collection of performing arts productions. Past: from Wikidata; Upcoming: data from Artsdata
    • Through a SPARQL query, it can find items whose title contain the visited item. For example, when someone visits "Romeo and Julia", other works with that title would be shown.



  • MarionNet.CA, by Jean-Robert Bisaillion
    • Presentation  https://docs.google.com/presentation/d/1W0-AUhXwMijwHzdo6pgIw2BtPi1Xc4NJs3tdCNgAA9I/edit?usp=sharing 
    • Final goal: find ways to merge puppeteer art with the European way.
    • They thought that using Wikidata could align their visions in Québec, Canada.
    • Data model is aligned with the Linked Digital Future Conceptual Model and the Wikidata ontology as documented in the WikiProject: Performing arts.
    • Initially did not have access to the PAM Lab European portal model, so they relied on Wikidata to enable reuse.
    • They used Wikimedia Commons to upload their puppet images. Through templates they have added the logo of the institution to the description of the images in Commons.
    • The data model includes four types of entities: the puppet artefact, legal entities, persons and puppet theatre shows.
    • SPARQL query to show images of puppets they have added to Commons and Wikidata (shown in the presentation): https://tiny.one/AQM01



  • OpenArtBrowser, by  Bernhard Humm
  • https://openartbrowser.org/en/
  • http://ceur-ws.org/Vol-2535/paper_2.pdf
  • https://twitter.com/openartbot
    • Fascinate with art by Wikidata. completely powered by Wikdata. Strolling through a virtual museum. Flow through the museum. 
    • When visting a painting, a description is shown. This text is retrieved from Wikipedia.
    • search function (semantics)
    • 7 dimensions for exploring artworks- artists, locations, motifs, materials, types, genres etc
    • Angular web application (elastic search server) 
    • Wish: SPARQL bulk queries with pagination (+ documentation) 
  • 💬 Questions & answers:
  • Footlight, by Gregory Saumier-Finch
    • From Sara Tiefenbacher: You mentioned upcoming events/performances ... is there a workflow for past (1-20 years ;)) events/performances? Thanks for answering! Question was from the perspective of a researcher ;)
      • Answer from Caitlin: Past events are there for a period of time. As we develop this tool and Artsdata, we intend to create an 'archive' for past events, that can be accessible for certain use cases. We don't have event from 20 years ago as we started this project in the last few years - at this time, the primary goal of Footlight is to help arts organizations improve discoverability of their upcoming events (rather than 'research' of past events). That said, we hope to develop an archive section to serve the needs of those interested in the past. :) 
  • Culture in Time,  by Gregory Saumier-Finch
    • Can you share what inspired you to create Culture in Time? Was there a problem you were trying to solve? 
      • Answer from Caitlin: we first developed Culture InTime at GlamHack 2020 with Beat Easterman with the goal to create "Cultural Calendar using existing LOD on past and future productions, venues, artists, and works. " Since that time, it has evolved to be more of a playground that we hope will support several different kinds of users (researchers, arts organizations, etc). We had another chance to develop it at GlamHack 2021 and again this year where Gregory added many of the features he showed you today. The problem we had was showing people a 'real world' use case of linked open data so they could better understand how to work with data and what you could do with it. (Hope that answers your question)
    • On Culture in Time: I like the idea of user-submitted SPARQL (if I correctly understand the concept) but how to deal with malicious queries like long running queries. Does it affect the system? (selecting ?s ?p ?o from all wikidata or something)
    • From Frédéric Julien to both Gregory and Jean-Robert: They key to unlocking the power of linked open data in the performing arts is to link event entities (i.e., performances) to production entities. At present, production information in Wikidata is rarely made available before the premiere. If production information was made available in Wikidata before the premiere, would we run the risk of Wikimedians considering these items as "promotional" or not meeting the notability requirements? (-> Community discussion needed)
      • Arts data as a complement to Wikidata
      • From Frédéric Julien: In my opinion, production information can and made available for reuse in Wikidata both before and after the performance event. I hope the community starts using it in this fashion. As a matter of fact, our next two Wikidata workshops will focus on production items: https://linkeddigitalfuture.ca/wikidata-workshops-season-2/
  • MarionNet.CA, by Jean-Robert Bisaillion 
  • ** What are some of the challenges you have encountered in this project? 
      • We had to bring contributing companies to feed us with the proper data totally from scratch without a user friendly interface.
      • Our destination database (PMB Open Source Integrated Library Management System)  is not yet ready to batch import from Wikidata. 
      • to convince people to use Wikidata
      • building a friendly interface to populate the information in Wikidata
      • PMB and other tools are not ready for bulk import of information
  • OpenArtBrowser, by Bernhard Humm
    • From Frédéric Julien: What are the data source for the individual motifs (i.e. « barque ») and the tags? Was the ISA mobile tool used to identify motifs on paintings?
      • No. Data was curated by Wikidata. 
    • When one of the paintings (Q152509) was visited in OpenArtBrowser, Youtube videos were also shown, the links to those videos are also retrieved from Wikidata? If not, how are those videos found?
      • No. Youtube videos are curated ourselves. 
    • Q - How could we reuse your timeline framework?  It doesn't seem to be clickable... 
    • A rather technical question: Given a Wikidata QID, is there any way I can open their corresponding page in OpenArtBrowser? For example, Q152509 should open https://openartbrowser.org/en/artwork/Q152509 and Q40089 should open https://openartbrowser.org/en/material/Q40089
    • How often is OpenArtBrowser synced with Wikidata?
      • Once in a week
    • From Caitlin: I love this application - to be able to 'stroll' through the artworks and discover similar or related works - a lovely online experience to expand our knowledge of what is out there. How do you choose which items to link and display? There must be so many. Is it keywords, algorighms? (Ranking based on a concept of importance. Thanks again!
    • thank you
      • Data source: Wikidata


Weaviate for Wikidata & Wikipedia – is vector search a new way for the discovery of Wikidata? edit

  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • Weaviate for Wikidata & Wikipedia – is vector search a new way for the discovery of Wikidata?
  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • ✨ Number of participants:  25 (20:02), 21 (20:15), 18 (20:23), 18 (20:35)
  • 🔗 Slides and useful links:
  • 🖊️ Notes:
  • Weaviate is an open source vector search engine
  • why do this? can use machine learning model to re-rank the results, and use more vague questions. not only text but also graph embeddings. 
  • a new way of searching through and representing data.
  • Weaviate uses a GraphQL API.
  • Bob said that all queries shown in this presentation are shown in the documentation.
  • Weaviate can perform semantic search queries by using "nearText".
  • It can also perform question answering. Question: "What was the population of the Dutch city Utrecht in 2019?". The query to answer that question shown in the presentation was retrieved from https://github.com/semi-technologies/semantic-search-through-Wikipedia-with-Weaviate
  • The power of Weaviate is in its speed to answer queries. Queries can be answered in a few milliseconds.
  • We can specify paths for answering questions: "InPath ["inArticle", "Article", 'title"]. We are saying Weaviate to match the WIkipedia article of "Michael Brecker" and now try to answer the question "What was the Michael Brecker's first saxophone?"
  • Weaviate introduces new types for searching and new ways for finding relationships in WIkidata. The method shown in this presentation only applies for text searching, but vector embeddings can also be used for audios, images and videos.
  • Help people to better organise the data.
  • 💬 Questions & answers:
  • What inspired you to create Weaviate? 
    • Problem: search with traditional based keyword search is insufficient anymore. 
    • Deep learning enables AI based search 
  • What are embeddings: https://en.wikipedia.org/wiki/Word_embedding
    • Basically what this is is converting words to vectors, so rather than "dog" we get [1, 0, 0]
    • We can then use linear algebra to calculate similarities
    • Another example for word embeddings is calculations like the following:
      • We get embeddings for words like man, woman, king, etc
      • We can then do king - man + woman = ? (getting queen, very simplified)
  • Does it support other languages than English? Like a WP page in another language.
    • Yes. We have seen French, Spanish etc. Able to import the whole Wikipedia/ Wikidata datasets. 
  • This is the first time I listen about embeddings. I understood that embeddings can be used to find relationships in text data, and you mentioned that this technique can be used with images and video, what are some questions that could be answered if this is done with images and videos?
    • In machine learning, the embedding is usually under the hood. 
    • Multi model. Can mix text and images after querying. Videos are tricky. Stills of videos are used as individual images. 
  • Just to get this right: 
    • The vector space is always the same (same number of dimensions), right?
      • Yes. 
    • What exactly is the Model trained on in the case of Wikidata?
      • Based on HuggingFace models, which are themselves ran over Wikipedia among other datasets.
      • Wikidata: Node distances (if I got that right)
    • Andrew: iOS app (keyboard for second language). using Weaviate can be possible. 
      • General idea for now:
        • Weaviate or something similar (maybe the ML behind it) could be used to quickly provide embeddings for words from Wikidata lexicographical data.
        • Machine learning implementations are needed for features like autocorrecting errors and autocompletion of the next word the user wants to type.
          • The sentence that the user is typing could function as the input for Weaviate or models it's based on, and then the next words the user might type would be the desired output.
        • Weaviate's support for multiple languages would be important as Scribe is specifically for non-English use cases (as of now).


Weaviate's public Slack channel: look for Bob. https://join.slack.com/t/weaviate/shared_invite/zt-goaoifjr-o8FuVz9b1HLzhlUfyfddhw


Open Food Facts and Wikidata – Structured food for thought edit

  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • Open Food Facts and Wikidata – Structured food for thought
  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • ✨ Number of participants:  23 (19:22), 21 (19:24), 19 (19:43)
  • 🔗 Slides and useful links:
  • 🖊️ Notes:
  • OpenFoodFacts - "The Wikipedia of Food", for people to make better and healthier food choices. Created 10 years ago.
  • OpenFoodFacts is helping to tackle UN-SDG goals:
  • ** 3. Good wealth and well being
    • 12. Responsible consumption and production
    • 13. Climate action
  • Food has impact on health, but also on CO2 emissions - but more and more consumers are having a positive impact on reducing emissions
  • When someone visits a supermarket and see lots of products of the same type (e.g. cereals), one usually asks: How to choose the best pack?  Open Food Facts tries to address this problem by providing people with information of food.
  • Nutrition information in product packages contains information, but it's difficult to make completely informed decision out of it if you don't rely on technology
    • Also: situation changes significantly from country to country
  • In the EU there are labels that tell you how much nutritious and how much processed a food is
  • empower users to have an impact on food health, environment and food system
  • mobile app that gives you info about nutrition score and processing score + collect info from your photos to improve the database 
  • Just as Wikidata, Open Food Facts stores open data so other applications can reuse the data
  • Statistics of OpenFoodFacts: 1.8 million products contributed by volunteers, 80.000 active contributors, 57 scientific articles linked to OpenFoodFacts. 182 countries with products.
  • OpenFoodFacts managed to obtain data from France about food to ameliorate the db.
  • How it works: User uploads photo or contributes data; data is OCR'd and processed; derived data is then matched with external databases; and finally shown on the website
  • Many data in a product: Name, quantity, brand, ingredients, origin of ingredients, nutritional facts, claims (halal, vegan...), packaging shape and material, info about recycling package (depending on country), where product has been created/processed/packaged...
  • According to Pierre, describing food is not straightforward, so Wikidata is used to build underlying hierarchies, translations (ways to use Wikidata 1)
  • Some level of control over how the taxonomy works, to prevent disruptions, especially on allergenes (Wikidata not directly used but linked back to)
  • Link the taxonomy back to Wikidata. Also build the taxonomies with Wikidata.
  • 15,000 ingredients in OpenFoodFacts, and a larger taxonomy that helps with digesting and translating data
  • By using OpenFoodFacts, someone could eventually understand a food packaging in Japanese since the information could be translated to any language (thanks to Wikidata?).
  • Other projects: OpenProductFacts and OpenBeautyFacts - basically the same approach but for cosmetics and other products, leveraging on EU norms and UNSPSC codes
  • Ways to use Wikidata 2: Enrich experience to final users
  • Wikidata in a new app! 
  • He took apple as an example because he considered a well documented food.
  • Wikidata could be used to build a graph of companies and their ownerships..
  • There's was a Menu Challenge on 2015 to improve items and translate items of food on Wikidata: https://www.wikidata.org/wiki/Wikidata:Menu_Challenge.
  • WikiProject Food is a WikiProject in Wikidata related to Open Food Facts: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Food_and_drink



  • 💬 Questions & answers:
  • I understood that photos of products can be uploaded to OpenFoodFacts. In Wikimedia Commons, photos of food packaging are sometimes deleted because the packaging contains copyrighted content (e.g. drawing of Mickey Mouse, Toy Story, etc.). Can such images be uploaded to Open Food Facts?
    • Yes! We have a Wikicommon compatible license. Never single problem with users. Can also use app to take photos. 
  • Just as happens with Wikidata, can data at OpenFoodFacts be queried with SPARQL?  If yes, is there any page where examples are shown? If not, are there alternatives to do data anylisis on the data at OpenFoodFacts? 
    • No SPARQL endpoint yet, but dumps/exports are provided.
    • OpenFoodFacts graph generator (link?)
  • What could we do to make Wikidata even more useful for you?
    • Tools are already convenient.
    • Wikipedia tiny issue: We would like to be able to directly link to sections of a Wikipedia article about chemical substances (e.g. "use in food").. e.g. by naming it consistently 
  • How is the data in OpenFoodFact and sister projects different from other databases in online food/ retail companies?  
    • Its Open! We are not the first food database to exist. But we are the first and only large scale Open database. 
    • We are even more thorough that producers have. We actually have more info
  • Self-reflection as a Wikidatian: Do we have any examples of good modeled connections between brand names and food items? (today most Wikidata items are conflating both concepts in the same item)
    • Coca-cola (product) vs Coca-cola (company). barcodes are not like QIDs. can refer to different things. 
  • [kind of answered now ^^] When a product changes its ingredient, is it sold with a new EAN or do they reuse the old one? No. Depends on the brand. In that case, do you just update the out-dated data or do you have a way to distinguish the different versions of the product?
  • If someone wanted to use a customized version of the Nutriscore (ex: considering that saturated fat is actually fine, while simple sugar/high glycemic index food should be banned), could that be done?
    • Yes. The raw data is also available, and you can build any algorithm to process it.
  • Do you see attempts to manipulate the data in Open Food Facts?
    • No. Little to No vandalism. Mostly children taking selfies :-) 
  • Does Open Food Facts have hackathons? Can we join? :D
  • Related to the question about photos with copyrighted content (Mickey Mouse, Toy Story): Suppose a big company complains about images of their products being uploaded to the site due to copyright claims, would the images be deleted?
    • Never had serious issues with that. 
  • Pierre can be reached at pierre AT openfoodfacts AT org


Lightning talks edit

  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • Lightning talks
  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • ✨ Number of participants:  55 (as of 17:25), 60 (as of 17.45)
  • 🖊️🔗 Notes, slides and useful links:



  • 💬 Questions & Answers:
  • Wikidata Hub (Maxlath)
  • Auto-updating Google Sheets spreadsheets from Wikidata queries (Navino Evans)
  • KGTK Wikidata browser (Amandeep Singh and Gleb Satyukov)
  • Cite Q (Andy Mabbett)
    • Is there some strategy for sister projects that reuse the template to stay in sync with the developments on enwiki?
      • (Mike Peel): It's a relatively stable template, so there aren't many updates to keep in sync. Ideally all uses should be linked to the Wikidata item for the template (https://www.wikidata.org/wiki/Q33429959), so that updates can be notified where it's used!
        • If we copied it in March 2020 to svwiki, are we in sync then?
          • No, since we did a big rewrite at the end of 2020 and start of 2021! (MP) (This is why global templates would be really useful!)
  • Flaneur-App (Erik Freydank)
  • Hewell (Brian Shrader)
    • Any plan for Android?
      • possibly in future
    • The name/description/etc. doesn't seem to mention Wikidata - could this be more prominant?
      • no reason why not
    • Integration with lib.reviews? (instead of the apple reviews)
      • using apple cloud kit all these while. just got to know about lib.reviews
    • How are the contributers on WP and Commons credited? Other then just mentioning CC BY-SA
      • each page in Hewell links out to Wikidata page (attribution). 
  • Spontaneous LT here?


Opening session edit

  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • Opening session
  • ---------------✨---------------✨---------------✨---------------✨---------------✨---------------
  • ✨ Number of participants:  50 (as at 17:10)
  • 🔗 Slides and useful links:
  • 💬 Questions & answers: