Wikidata:ExtractTransformLoad/Analysis

Executive summary

The goal of our project is to enhance the data transformation tool LinkedPipes ETL (LP-ETL) in a way that allows technical savvy volunteers willing to bulk load (a.k.a. mass import) data from authoritative sources on the Web to Wikibase instances and, in particular, to Wikidata to do so with a higher degree of automation and in a more repeatable and sustainable way than before.

In this document we overview the Wikibase architecture and data flow, analyze the process of bulk loading data to Wikibase and the pitfalls we encountered. Then we specify new functionality that needs to be implemented in LP-ETL to help with the process. This analysis document concludes the first part of our project - WP1 - Analysis and design.

Analysis

During the process of analyzing the possibilities of editing Wikibases and Wikidata in a similar way to how LinkedPipes ETL is currently used on the Web, we struggled with imprecise and incomplete documentation of the Wikibase API. This issue was also admitted by the Wikibase API authors and Wikidata contributors we met at the Wikimedia Hackathon 2019 in Prague. Therefore, we describe the encountered issues and create an overview of what we are dealing with in this section.

Wikibase data structure - items, properties, statements

Figure 1 - Wikibase Item structure

Wikibase items (see below) are explained in detail for instance on the Wikidata wikipage or in the Wikibase documentation. To simplify, data in Wikidata is structured as Items representing data entities, and Statements, consisting of properties and values saying something about the Item (see Figure 1).

Wikibase data flow and architecture

Figure 2 - Data flow in MediaWiki with Wikibase and Query Service extensions

The whole architecture of a queryable Wikibase such as Wikidata and the data flow within it are more complex than it would seem at first glance, as is illustrated in Figure 2. There is a recent blogpost by Wikidata tech lead Addshore detailing the technical parts, we will focus more on the overview and the context of bulk data loading. In this section, we are going to describe what is happening from the moment there is some source data on the web to the end, where data is updated in Wikibase and queryable via the Query service.

The red parts represent sources of data on the web, and their transformation to Linked Data using LP-ETL, i.e. what is happening on the Web now. The green parts represent the manual editing process by contributors using the Wikibase Item editing form, which uses the Wikibase JSON-based API to store the data about items as wiki pages in MediaWiki with the Wikibase extension. One notable disadvantage of storing data like this is the lack of ability to query the data other than full-text search on item labels.

To be able to query the items properly, e.g. based on values of some statements, we need a proper database. That is the Query service extension, which essentially is a standard Blazegraph RDF database providing a SPARQL endpoint for querying, as we know it from other Linked Data sources, represented by the white parts. The true magic happens in the blue parts, which represent the process of transformation of items stored as wiki pages to items stored as RDF data. Finally, what we focus on in this project, is the yellow part, i.e. using LP-ETL to load Linked Data to a Wikibase instance.

Let us now focus on individual parts of the architecture and data flow in more detail.

Preprocessing data and loading it into Wikibase

The first step in every data transformation is getting the source data in a format, which can be worked with using a given tool. This is the part where the existing implementation of LP-ETL helps, as it is already used in the wild to transform data from various data sources to Linked Data, i.e. data in RDF, using well-known vocabularies, linked to other data. Therefore, it contains a library of reusable components for various occasions, and there is a set of tutorials and guides showing users how to work with the tool. Therefore, this part is not in the scope of our project, except that we will process some external data for the proof-of-concept of our approach.

When faced with the task of getting this data to Wikibase or Wikidata, the contributors have several options of how to approach it.

The manual way

Figure 3 - Item edit form in Wikidata

The first option is that they manually create and update the data in the Wikibase instance based on the content of the source data using the Wikibase Item edit form, as can be seen in Figure 3.

Using existing tools

For larger datasets, the manual approach obviously does not scale well. Here comes the second option, which involves preprocessing of the data manually and loading it into available tools which can already handle bulk loading data into Wikidata, such as QuickStatements and OpenRefine, represented by the "Various tools" box in the diagram in Figure 2. This is still a manual process. The only automated part is the loading of prepared statements into a Wikibase instance. Therefore, when the source data changes, the contributor needs to do the whole process again. The data preprocessing is done either manually, or using ad-hoc scripts, which typically only the author understands. This makes this approach hard to maintain.

Using LinkedPipes ETL

The third option is to use LP-ETL for the entire process. This would mean to first transform the source data into Linked Data which could even be published outside of Wikidata and then to map it and load it into a Wikibase instance. LP-ETL is already in use on the Web to publish Linked Data, which is represented by the red parts in the diagram in Figure 2. The loading of data into a Wikibase instance is what LP-ETL currently does not support. This is the yellow part in the diagram and the core of this project. The LP-ETL pipelines can then be scheduled to be run at regular intervals. In addition, individual transformation pipelines for individual data sources look similar, making them maintainable by multiple contributors, and they use the same set of reusable components instead of ad-hoc transformation scripts in various programming languages.

Transformation of wikipages to RDF data

When data (Items) is loaded into a Wikibase instance, it exists only in the form of structured wikipages corresponding to the individual items. Most of all, this means that the data can only be queried in the same way as any other wiki, i.e. using a full-text search on top of Item labels, aliases and descriptions, which is not enough. In particular, we cannot query like this for all items of a certain type which are stored in the Wikibase, which is what we need to be able to determine which items from the source dataset are already present in the Wikibase and which need to be created. For this, we need the Query service, represented by the blue and white parts in Figure 2.

The blue parts in Figure 2 represent two data transformation scripts. The Dumper script reads the whole Wikibase and produces an RDF dump using the Wikibase RDF data model. This dump (or the published Wikidata dump) is then loaded to Blazegraph to initialize the database. The Updater script then runs in regular intervals, monitoring the “Recent changes” API of the Wikibase. When an item is changed, the script requests its RDF representation (e.g. here) and updates Blazegraph accordingly.

The white parts of Figure 2 represent what we already know from the world of Linked Data, and that is the Blazegraph RDF store publishing a SPARQL endpoint, which can be used to query the RDF representation of Wikibase data, enabling us, for instance, to determine which items from the source dataset are already present in the Wikibase and which need to be created, which we cannot do with the Wikibase alone. This is also used by the Wikidata Query Service UI and other applications.

Wikibase RDF data model

There is the detailed RDF dump format documentation in the Wikibase docs. The knowledge of the RDF format is a prerequisite for contributors to be able to use the outcome of our project. The RDF data model is used in the Query service when querying using SPARQL. We also plan on using the same model as the input format for the newly created LP-ETL components handling the loading of such data to Wikibase to shield the users from the Wikibase JSON-based API which, as confirmed by the API developers we met at the 2019 Wikimedia Hackathon in Prague, is unfriendly to use. Some of the issues of the API are detailed in the next section.

During the analysis of the RDF data model and its usage in Wikidata and Wikibase, we have also identified the following issues, which will need to be taken into account when using our solution, especially for contributors already familiar with RDF and SPARQL in the decentralized Linked Data environment.

Issue 1: Difference in Wikibase vocabulary prefix

Older Wikibase instances based on the currently still stable MediaWiki 1.32 use this prefix: @prefix wikibase: <http://wikiba.se/ontology-beta#> . while Wikidata and newer Wikibases use a no longer -beta one: @prefix wikibase: <http://wikiba.se/ontology#> .

Therefore, when a contributor decides to query another Wikibase instance, they need to change this in their query. The same goes for the loading part. This issue may resolve itself over time, when the older MediaWiki instances will be updated.

Issue 2: Different IDs of items and properties

The second issue is inherent to how Wikibases and Wikidata work and we do not expect that this behaviour will change. Nevertheless, it is something worth noting. Every Wikibase instance manages its own IDs of items and properties. Therefore, when loading the same data to two different Wikibase instances, not only the same items, but also the same properties will have different IDs in each of the instances. This means that when a contributor switches loading of the data from one Wikibase instance to another, they have to rewrite not only the prefixes, but also the IDs of all properties used. In addition, the IRIs used in the RDF representation of the data of a particular Wikibase are based on the URL of that particular Wikibase. For example, when the Wikibase URL is https://wikibase.opendata.cz, the URL of Item with ID Q2079 is https://wikibase.opendata.cz/entity/Q2079.

Issue 3: HTTP/HTTPS difference in Wikidata base URL and Wikidata Items IRI prefix

HTTPS has become a standard on the Web today and the Wikidata URL is https://www.wikidata.org. However, the IRIs used in the RDF representation of Wikidata use the HTTP URL scheme. Therefore, when working with a Query service or loading data using our new component, it is not possible to derive the Item IRIs based on the URL of the Wikibase instance. This means that there will have to be an additional parameter specifying the base of the IRIs used in the data.

Wikibase JSON data model and JSON-based API

Bulk loading data into Wikibase instances is possible only via the Wikibase JSON-based API. The documentation also links to a more up to date, version of the API documentation in Wikidata, which also includes examples. While we got to know the JSON data model used by the JSON API thanks to this detailed description of the data model in JSON, this will not be required by the users of our solution. We hope to shield them from its complexity by using the Wikidata Toolkit library as the base of our newly created LP-ETL component and the Wikibase RDF dump format as the format for the input data.

We tried to use the JSON-based API with what we already had in LinkedPipes ETL. From this experiment, the following pipeline was created.

Figure 4: LinkedPipes ETL pipeline showing the complexity of JSON-based API

The pipeline is obviously quite complex. The whole upper half of the pipeline deals only with logging into the API and handling the set of resulting tokens and the resulting Cookie. The bottom half is responsible for creating missing items and updating existing ones and is also more complex than necessary, but this is due to the API being JSON based and LinkedPipes ETL being RDF oriented. Therefore, all API responses needed to be converted to RDF and all API calls had to be parametrized by JSON documents, which had to be generated.

In any case, getting the source data and transforming it to Linked Open Data is handled by the 6 lower left components, and that number is quite high only because the source data was only available upon HTML form submission, which had to be simulated.

Issue 1: API complexity - Cookies, Tokens, CSRF token expiration

A typical Web API tries to get as close to a RESTful API as possible. It typically consists of a set of endpoints and well defined operations on them, using proper HTTP verbs and IRIs of manipulated resources. Unfortunately, this is not the case for the Wikibase API, which primarily supports manual operations performed by people using the MediaWiki web application. The complexity and unfriendliness of the Wikibase JSON-based API is also discussed in a blog post about handling it using the Go programming language. Using the pipeline in Figure 4 and a considerable amount of googling and using Postman, we were actually able to log in to the API and create and modify items. In order to perform an edit operation, a user (or his library) has to:

Get a Login token (LT) and a cookie (C1) by accessing e.g. https://www.wikidata.org/w/api.php?action=query&meta=tokens&type=login&format=json
Using LT and C1, perform the login using username and password and get another cookie C2
Using C2, access e.g. https://www.wikidata.org/w/api.php?meta=tokens&format=json&action=query and get a CSRF token
Using C2 and the CSRF token, perform an edit operation.

Issue 2: CSRF token expiration

Besides the complexity, there is one feature of the API which we were not able to handle, and that is the CSRF token expiration. A CSRF token is used in web applications to protect against Cross-site request forgery when people are browsing a website in their web browser. It has no place in a machine-friendly Web API. Nevertheless, when discussed with the Wikibase API developers at the 2019 Wikimedia Hackathon, it was clear that this behaviour is not going to change and the best course of action is to use a library, which already handles all this. In LP-ETL, the pipeline execution is controlled by the data flow, which cannot handle situations where after some successful API calls a "token expired" error appears, requiring a token renewal action.

Issue 3: Long-running loading operations

Currently, LP-ETL can resume pipeline execution when the execution fails between the execution of two consecutive components. However, with loading data to Wikibase or Wikidata, it seems that the loading itself, which will be handled by a single component, can take even weeks. Currently, if the loading would fail in the middle, the whole loading operation would have to be run from scratch. One of the requirements on our solution, therefore, is to be able to resume execution even within a single component, when its mode of operation allows it.

Wikidata and bot accounts

A final, non-technical challenge of our approach is getting a Wikidata bot account. While on a private Wikibase instance, getting a bot account is not a problem, in Wikidata, there is an approval process for this, when a user needs higher limits for edit operations. The Wikidata contributors we asked confirmed that this is indeed necessary for Wikidata, as normally, the edit rates are low.

Requirements on LinkedPipes ETL enhancements

Based on our analysis, consultations with Wikidata contributors and our experience with publishing Linked Open Data, we have identified the following requirements on the solution.

Global requirements on improving LinkedPipes ETL

G1: The whole project needs to be dockerized for easier deployment. This was already requested by other users and this is a perfect opportunity to finally do that.
G2: An HTTP-based file browser needs to be implemented for debugging of pipelines. Currently, there is only and FTP-based one, and browsers are continuously limiting FTP support. In addition, the FTP file browsing required additional TCP ports to be opened. Again, this was in the backlog for a long time and now it is really necessary.
G3: Resumable loading operations within one component - generally, or at least in the Wikibase loading component, due to expected long-running loading operations.

Requirements on Wikibase Uploader component

Requirements on Wikibase Uploader component Since the Wikibase data model is quite extensive, the resulting component will most likely not be able to cover all possible variances in the input data. We will primarily focus on creating and updating statements along with their references as required by our proof-of-concept use cases, adding more specific functional requirements as we go. The initial set of requirements is this:

U1: The component will adhere to the MediaWiki API Etiquette, specifically, it will use its own User-Agent HTTP header and the maxlag parameter
U2: The component will accept the following configuration
- Wikibase API URL, e.g.
  - https://wikibase.opendata.cz/w/api.php
  - https://www.wikidata.org/w/api.php
- Username
- Password
- Wikibase ontology IRI base, e.g.
  - For Wikibase, it is @prefix wikibase: <http://wikiba.se/ontology-beta#> .
  - For Wikidata, it is @prefix wikibase: <http://wikiba.se/ontology#> .
- Wikibase instance IRI base, e.g.
  - https://wikibase.opendata.cz
  - http://www.wikidata.org
- Maximum edits per minute
U3: The component will accept the following input
- Data structured according to the Wikibase RDF dump format
- Items and statements to be created are flagged using a proprietary vocabulary
- Property IRIs will be based on the target Wikibase ontology URI
U4: The component will be able to create items and statements about them from the input in the target Wikibase
U5: The component will be able to create references for created statements
U6: The component output will contain mapping of the input items to be created to the QIDs of the actually created items.

Prerequisites

During our analysis, we have identified the following prerequisites, which need to be satisfied in order for our solution to be usable:

All property IDs to be used, including those used for references (in SPARQL) already exist.
Target Wikibase has a Blazegraph Query Service endpoint including the updater service running.
The pipeline authors are able to determine which items from the data source already exist in the target Wikibase instance (based on the content of the associated Blazegraph endpoint) using SPARQ, which includes familiarity with the RDF dump format.