Wikidata:Wikidata curricula/Activities/Pywikibot/Missing label in target language
A very simple Pywikibot script to replicate missing language labels.
Example: Select all Belgian persons (politicians, engineers, business people, ...) having missing language labels.
This was my first Python script I ever wrote.
First I implemented it as a Wikidata Script with Quickstatements. But it is much more effective with a fully automated Python script (no manual data manipulation with Excel required).
Usage
editYou can run this script from the shell, or from PAWS
./missing_person_label.py [input language] [output languages]...
Simplified script
editThe below simplified script gives you an idea of what it does.
Note: You can consult the complete script code history, including technical details, and more documentation.
Tips:
- You can select the country, the source, and the target languages
- I am running this on an always-on Raspberry Pi (actually a Piwikibot 🙂)
- Low power consumption, and it is serving other functions in the home anyway...
- My laptop does not need to stay online... saving electricity, and allows me to travel onto another network while the script runs on the Pi...
#!/usr/bin/python3
import sys
import time
import pywikibot
from datetime import datetime
from pywikibot import pagegenerators as pg
def wd_proc_all_items():
QUERY = """# Search for Belgian/Netherlands citizen with missing en label
SELECT DISTINCT ?item WHERE {
VALUES ?instance { wd:Q5 }
VALUES ?country {wd:Q31 }
?item wdt:P31 ?instance;
wdt:P27 ?country;
rdfs:label ?itemLabel.
FILTER((LANG(?itemLabel)) = '""" + inlang + """')
MINUS {
?item rdfs:label ?label.
FILTER((LANG(?label)) = '""" + outlang + """')
}
}
"""
print(QUERY)
wikidata_site = pywikibot.Site("wikidata", "wikidata")
generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)
i=0
errsleep = 0
print('Getting data')
now = datetime.now()
for item in generator:
i += 1
status = 'OK'
label = ''
try:
item.get()
if inlang in item.labels:
label = item.labels[inlang]
if label == '': # Label not available; skip update
status = 'Ignore'
elif outlang in item.labels: # Target label already updated; skip duplicate update
status = 'Skip'
else: # Update the target item label
item.editLabels( {outlang: label}, summary="Pwb copy " + inlang + " label" )
errsleep = 0
except KeyboardInterrupt:
sys.exit(1)
except:
status = 'Error'
totsecs = int((datetime.now() - now).total_seconds()) # Calculate technical error penalty
if totsecs >= 30: # Technical error
errsleep += totsecs * 5
if errsleep > 0: # Allow the servers to catch up
print('%d seconds maxlag wait' % errsleep)
time.sleep(errsleep)
prevnow = now
now = datetime.now()
isotime = now.strftime("%Y-%m-%d %H:%M:%S")
totsecs = (now - prevnow).total_seconds()
print('%d\t%s\t%s\t%f\t%s\t%s' % (i, isotime, status, totsecs, item.getID(), label))
param = sys.argv # Get the command parameters
if len(param) <= 2: # Welcome the user
print('in out...')
else:
param.pop(0) # Skip the name of the executable
inlang = param.pop(0) # P1 = Source language (mandatory parameter)
for outlang in param: # Loop for all target languages (mandatory parameter)
if inlang != outlang: # Skip input language
wd_proc_all_items() # Execute all items for one language
Prerequisites
editYou need to install and configure Pywikibot on a (virtual, private) Linux system, or use PAWS on a shared server.
Known problems
editIt is important to have a proper error handler to allow the script to recover from single transaction errors. Without proper error handler the script would fail (repeatedly) with a fatal error on the (same) first transaction in error and would not continue with the rest of the transactions.
Execute by item
- Updates should be executed by item, instead of by language (avoid multiple watch notifications)
User errors
- WARNING: Http response status 400
- Syntax error in the SPARQL code
- Paste your query in Wikidata Query to verify and correct
- Syntax error in the SPARQL code
- ERROR: An error occurred for uri ... WARNING: Waiting 240 seconds before retrying.
- Too many items in query: add additional filters to reduce the number of items
- You can relax the filters after data gets processed
- WARNING: Http response status 429
- LIMIT too high or missing
Data errors
- WARNING: wikibase-form datatype is not supported yet.
- WARNING: wikibase-lexeme datatype is not supported yet.
- WARNING: API error modification-failed: Item Q682310 already has label "Gerrit Schimmelpenninck" associated with language code en, using the same description text.
- The target label remained empty
- Another item had the same label and description
- Edit the 2 items to have a unique description (e.g. adding the birth/death date)
- You should add different from (P1889) for both items
- Possibly you might need to merge 2 identical items
- WARNING: API error editconflict: Edit conflict. Could not patch the current revision.
Login failure
- WARNING: API error badtoken: Invalid CSRF token: general problem with authentication (temporary)
Server errors
- MaxlagTimeoutError: retry later (replication server busy)
- OtherPageSaveError: ignore (the update was still made; verify the item update history)
- ReadTimeoutError: retry later (HTTPS network error)
- WARNING: API error failed-save: The save has failed. (general error)
- Sleeping for 9.0 seconds, 2020-06-11 00:17:54
- The script runs pretty slow, not to overload the servers (about 10 transactions per minute; use
put_throttle = 6
) - Set
noisysleep = 60.0
to avoid too many "Sleeping" messages - When a transaction error occurs, the application sleep for some minutes (maxlag wait suspected)
- Create and configure a bot account (higher transaction speed allowed)
- You can increase the execution speed by assigning a lower value to put_throttle
- The script runs pretty slow, not to overload the servers (about 10 transactions per minute; use
- Maximum retries attempted due to maxlag without success.
- Maximum retries attempted without success.
- WARNING: API error readonly: The database has been automatically locked while the replica database servers catch up to the master
- WARNING: API error internal_api_error_JobQueueError
Network errors
- requests.exceptions.ConnectionError: HTTPSConnectionPool(host='query.wikidata.org', port=443): Read timed out.
- requests.exceptions.ConnectionError: HTTPSConnectionPool (network error)
- Remote end closed connection without response
PAWS
- "Username unknown" problem with OAth and sitelinks to special namespaces
Workaround: create the missing username with e.g. wikisource:Special:UserLogin
Notes
edit- You should manually amend failed transactions
- Transactions that would be skipped once, due to a transient error, can be retried later.
- You should wait until the transactions are replicated to the SPARQL reporting instance before re-executing the script to avoid duplicate transactions
External link
edit- https://github.com/geertivp/Pywikibot/blob/main/copy_label.py (replaces the above script)