Wikidata:Pywikibot - Python 3 Tutorial/Big Data

Big Data harvesting requires more efficient methods.

This chapter will introduce the concept of gathering data from more than one Wikidata-item. As you can probably guess, iterating over more than 10 million items is extremly inefficient. We therefore need a way to pre-select a subset of all items.

Introduction edit

To follow along with the next few examples you should understand generators. A generator acts similar to a list in a for-loop. Instead of iterating over the items of a list, a for-loop will iterate over each item that will be returned by the generator.

The examples on this page can be easily connected with the examples from the previous chapters to query certain statements in the for-loop and writing functions that will save the data to disk.

Selecting Items by Template Usage edit

One way to iterate over a subset of items is to choose them by the usage of a template on Wikipedia. The way to do this is to write a generator that will return each page for us to iterate over. The example will look at the usage of Template:Infobox meteorite (Q6037522) on en-wiki, but you can also replace the string with another template. It is difficult to separate the parts of the example. Read it, run it and then we will discuss some of the new things:

import pywikibot

from pywikibot import pagegenerators as pg

def list_template_usage(site_obj, tmpl_name):
    """
    Takes Site object and template name and returns a generator.

    The function expects a Site object (pywikibot.Site()) and
    a template name (String). It creates a list of all
    pages using that template and returns them as a generator.
    The generator will load 50 pages at a time for iteration.
    """
    name = "{}:{}".format(site_obj.namespace(10), tmpl_name)
    tmpl_page = pywikibot.Page(site_obj, name)
    ref_gen = tmpl_page.getReferences(follow_redirects=False)
    filter_gen = pg.NamespaceFilterPageGenerator(ref_gen, namespaces=[0])
    generator = site_obj.preloadpages(filter_gen, pageprops=True)
    return generator

site = pywikibot.Site("en", 'wikipedia')
tmpl_gen = list_template_usage(site, "Infobox meteorite")

for page in tmpl_gen:
    item = pywikibot.ItemPage.fromPage(page)
    print(page.title(), item.getID())

The first line that is executed gets the Site-object of the English Wikipedia. The second already calls the function that returns the generator. The function takes two arguments and is therefore sufficiently flexible to handle any language of Wikipedia and different templates. Notice that we don't write "Template:Infobox meteorite". The namespace is added in the function itself.

Within the list_template_usage() function we first construct the string consisting of namespace + template-name. The namespace is queried from the site object (site_obj.namespace(10) returns "Template"). Next we need to get the Page object of the template page passing the Site object and the template name.

Once we have the template Page object we get the referring pages generator (returns a PageGenerator object). We then need to pass this to the NamespaceFilterPageGenerator (namespaces [0] is "", an empty string and the standard namespace in which Wikpedia entries reside) and finally the preloadpages generator, which is returned by the function. These lines are more advanced and to find out more about them read the source in pywikibot/site.py and pywikibot/pagegenerators.py.

Finally we use the tmpl_gen variable that stores the generator to start a for-loop. The for-loop will get 50 pages at a time, iterate over them and then ask the generator for the next batch of pages until the generator will yield no more pages. The print statement we put in the for-loop will output the following:

Retrieving 50 pages from wikipedia:en.
Wold Cottage (meteorite) Q4053207
Allan Hills 84001 Q47580
Campo del Cielo Q1031478
Sayh al Uhaymir 169 Q2228546
Sikhote-Alin meteorite Q652204
...
... total of 50 objects
...
Retrieving 50 pages from wikipedia:en.
Gao–Guenie meteorite Q176241
Pallasovka (meteorite) Q7127754

We can see that is a really powerful and easy way to preselect items for querying Wikidata.

Selecting Item by Wikidata Statement edit

Pywikibot also allows to select items by statement. This can be done using a SPARQL query. As an example, we will look at all the items that have a pKa (P1117) value set.

First of all, we need to build the query and check if we get the correct results. I created this query:

#Items that have a pKa value set
SELECT ?item ?value
WHERE 
{
	?item wdt:P1117 ?value .
}
Try it!

Attention: When you build your own query, note that ?item is currently the only variable allowed by Pywikibot for selecting items. Likewise, ?itemLabel is the only allowed variable to select labels.

This currently yields around 200 results.

In the next step, we can copy the query into a file named pka-query.rq in our project directory (.rq is the file extension for SPARQL queries).

Loading the query in the script is straightforward, and the following snippet shows how to call the generator and iterate over the items:

#!/usr/bin/python3

import pywikibot
from pywikibot import pagegenerators as pg

with open('pka-query.rq', 'r') as query_file:
    QUERY = query_file.read()

wikidata_site = pywikibot.Site("wikidata", "wikidata")
generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

for item in generator:
    print(item)

This will output a list of each item selected as ?item.

Conclusion edit

This chapter had the goal to teach you to iterate over Wikidata in a more intelligent way than going from Universe (Q1) all the way up to the most recent item. Try to keep this selecting logic in a separate function, so that you can adapt your bot to a different use-case without changing too many lines of code.