Wikidata:Contact the development team/Query Service and search/Archive/2020/09

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Please add wikibooks to SERVICE wikibase:mwapi on query server (August 21)

Tracked in Phabricator
Task T261125

Currently this throws an error:

java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Host en.wikibooks.org is not allowed

Sample: [1]. wikibooks in any language should be available. --- Jura 20:31, 21 August 2020 (UTC)

Thanks for noticing, I created phab:T261125 to address the problems. DCausse (WMF) (talk) 13:06, 24 August 2020 (UTC)

@DCausse (WMF): thanks for fixing it. I corrected the sample above: works now. --- Jura 15:59, 16 September 2020 (UTC)

Query through Python/Requests 100x slower than query.wikidata.org

Running basic SPARQL requests through my Python application (example here) takes ~75s on average, and the same requests take ~700ms on query.wikidata.org.

I think there's a chance my IP address may have been throttled as I didn't become aware of the Wikidata User Agent policy until recently (see my question on August 18). I've since updated my application to follow the Wikidata policy (here).

When I started building this application my queries were running at a similar speed to query.wikidata.org. Is there anything that can be done to speed up these queries again? I'm already running all text search queries through an Elasticsearch index I've created from the JSON dump, but need a way to get entity claims and shortest paths between entities in bulk.

Thanks Kdutia (talk) 14:44, 11 September 2020 (UTC)

@Kdutia: Have you considered running at least some of what you need on toolforge? It may have some other advantages if you are providing a public service of some sort that's based on Wikidata. ArthurPSmith (talk) 17:21, 11 September 2020 (UTC)

@ArthurPSmith: I hadn't come across Toolforge but I don't think it's suitable for our current needs. Although the application I'm building is for public use, it's designed so developers (GLAM institutions) install and run their own instance rather than accessing a centralised application instance. We're also using big GPU VMs to train machine learning models. I've looked through the documentation but not sure what the advantages would be - could you tell me more about the additional advantages of developing on Toolforge? – The preceding unsigned comment was added by Kdutia (talk • contribs) at 10:23, 14 September 2020 (UTC).

I don't know if there's an advantage for SPARQL use, but Toolforge has replica databases that can be queried directly which is useful for many applications. But it sounds like you are running into something else here? ArthurPSmith (talk) 17:39, 14 September 2020 (UTC)

@ArthurPSmith: Interesting, thanks. I'll look into the replica databases. I'm just trying to find labels of claims and shortest paths between entities efficiently at the scale of thousands. --Kdutia (talk) 15:17, 15 September 2020 (UTC)

@Kdutia: The servers serving traffic from the query service UI and via APIs are the same and thus there are no reasons an accepted request via the API could be slower than via the UI (with some variance, current load and caching status). If your client is being throttled then the query should receive 429 (or 403 if the retry-after is not respected) and not wait. Do you know how much your client wait on its own (throttled) vs waiting for the HTTP request to complete? Thanks for using your own elasticsearch index by the way, did you notice that the elasticsearch index contains some claims in the statement_keywords array field that might be handy for simple queries? Note that there is also the option to host your own blazegraph instance if you can afford it. DCausse (WMF) (talk) 10:02, 15 September 2020 (UTC)

@DCausse (WMF): I've checked and I'm not receiving 429 or 403 errors. My code respects the Retry-After header for a 429 and waits 10 seconds if it doesn't receive a Retry-After header. The main two queries I'm trying to run are shortest path and getting specific claims and their labels back for a list of entities - I just tried the shortest path query on the query service UI and on a Jupyter notebook and it's 100x slower using my Python method. I'm using my own tool to import labels, aliases and a subset of claims (at the moment just P31 and P279) into an Elasticsearch index at the moment. Where can I find more information on statement_keywords? I've seen the guide for setting up our own Wikidata instance, but trying to avoid this for as long as possible as the idea is the tool we're building can easily be set up and run by GLAM institutions - and setting up a Blazegraph instance would probably be the most costly part. --Kdutia (talk) 15:17, 15 September 2020 (UTC)

@DCausse (WMF): I've realised one more thing that seems to indicate it may be a server-side issue. In the morning my shortest path queries run in under a second, but by the early afternoon UK time (1/2pm) they revert to a speed of about 70-100s/iteration.