User:Pintoch/Issues with pywikibot

This page lists some issues about pywikibot. I think these issues are rather serious, and it is not clear to me if it is worth investing time to fix them or just to abandon ship and work on alternate libraries. I believe this is an important discussion to have, as pywikibot is often showcased by the Wikidata community as the standard, official library to use when writing bots in Python.

Issue 1: pywikibot is not a Python library edit

Pywikibot has a long history - it was originally designed to provide automated tools for Wikipedia. The home page for pywikibot shows that pywikibot is primarily advertised as a collection of reusable scripts that can be used from the command line. These scripts rely on a bunch of common Python files, which are supposed to form an independent library. But in many ways, it seems that this "library" is only a byproduct of pywikibot: the library is not primarily designed to be reused. Here is why:

  • There is basically no release plan - packages are added on PyPI only when someone finds the time to push a snapshot of the master branch there.[1] Earlier this year, it was impossible to login (hence to edit) with the packaged version on PyPI for months.[2] This really shows that pywikibot is not meant to be used as a dependency in a Python project - packaging is not a priority at all and users are just told to clone the library with git (this installation method is even mentioned on PyPI before pip install pywikibot itself).
  • Pywikibot is not structured as a library. As soon as you import pywikibot in a Python file, you are required to have a user-config.py file in the current working directory, where your credentials are stored as Python objects. No decent library would ever require that by default - just think about the uprising that would happen if the maintainers of the requests library decided that requests 4.0 would load user-agent.py in the current working directory to set the User-Agent header of all requests it makes. Of course it is possible to bypass this behavior (by setting an environment variable) - but it is dirty in any case. I understand the concern of separating the user data from the bot code but this is a standard issue which can have much cleaner solutions. User credentials should be stored in Site objects and hence only loaded when a Site is instantiated, not when importing the library. Incidentally, this would make it much easier to add support for login methods where the credentials cannot be hard-coded in a file, such as OAuth.

So, ideally pywikibot should be separated in two distinct projects:

  • a library that can be reused in Python projects (a generic wrapper of the MediaWiki / Wikibase API), with a sane architecture
  • a collection of off-the-shelf scripts relying on this library to perform some common tasks (these tasks can be specific to particular Wikimedia projects) - this can be more hacky as the scripts are not intended to be building blocks of larger projects

But again, it is not even clear that it is worth reusing pywikibot's core to make a library with a sane architecture.

Issue 2: pywikibot's representation of the Wikibase data model is overly simplified edit

When writing a wrapper for an API, it is tempting to try to simplify some aspects of the API or the data model in the underlying service. The idea is that some level of abstraction is necessary, otherwise there is no real benefit over making HTTP requests directly. The issue with pywikibot is that this abstraction is brittle, because the Wikibase data model is overly simplified. This is hard to fix without restructuring the library and breaking compatibility in a major way.

In the Wikibase data model, a snak is essentially a pair of a property and a value for that property. A statement is a claim that you see on an item. Each statement contains a snak, called the main snak, which represents the property of the statement and its main target value. But a statement contains more than that: it also holds snaks for its qualifiers, as well as references, which contain themselves lists of snaks.

Pywikibot conflates two distinct object types in the data model: statements and snaks. Both concepts are represented by the Claim class.[4] It is a simplification that is quite tempting because most end-users do not want to know what a snak is. While this could be desirable for a high-level interface, the issue is that even the internals of the library conflate the two. This makes it impossible to perform some actions supported by the Wikibase API, such as editing multiple statements at once with wbeditentity. Let us analyze why.

As one can expect, Claim objects can be created from the JSON representation of a statement, and any Claim object can also be serialized back to JSON. One would naturally expect that calling toJSON on a Claim object created from a JSON representation would give us the original JSON (up to key ordering as always with JSON). But that is not the case: for instance, reference hashes will be erased. Of course this breaks a number of things, most importantly the computation of incremental changes between two versions of an item (for use with wbeditentity).

Fixing this properly requires a major rewrite - is it worth it?

Alternatives edit

In Python:

In Java:

Feel free to add links to other wrappers that can be used to write bots.

Notes edit

  1. https://phabricator.wikimedia.org/T152907
  2. https://phabricator.wikimedia.org/T142155
  3. https://doc.wikimedia.org/pywikibot/
  4. The Claim class actually subclasses the Property class, which represents Wikibase properties. This is another aberration, but it does not seem to cause any actual trouble beyond bleeding the eyes of the foolish passers-by who venture into pywikibot's source code.