Wikidata:Contact the development team/Query Service and search/Archive/2020/03

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

WDQS lag

Tracked in Phabricator
Task T243701

Hi. Could you tell us / point to any discussions on fixing the current lag situation with a number of WDQS report servers - see https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-2d&to=now&fullscreen&panelId=8 ... I don't see an obvious ticket on Phab for this. thx. --Tagishsimon (talk) 13:51, 3 February 2020 (UTC)

fwiw, it doesn't look as if maxlag is trying to deal with the situation - https://grafana.wikimedia.org/d/000000601/wikidata-addshore-monitoring?orgId=1&from=now-2d&to=now&fullscreen&panelId=2 --Tagishsimon (talk) 13:53, 3 February 2020 (UTC)

See also discussion on Administrator's noticeboard - one bot has been blocked, and there seems to be a slow recovery going on, but what is really strange is that one of the servers (wdqs1004) is mostly fine with lags of at most a few minutes, while the others in that group (wdqsq1005,6,7) have multi-hour lags. This seems to have been happening since Saturday Feb 1. Can any developers shed light on what might be going on? ArthurPSmith (talk) 13:49, 4 February 2020 (UTC)

Some time yesterday, updates just seem to have stopped and lag increased linearly ..

Even if the bot didn't work in an optimal way (it was working on large items), I doubt it was the only reason. --- Jura 13:53, 4 February 2020 (UTC)

I am pretty sure that they don't *really* understand what's going on either. For months now, WDQS in not able to cope with the load it is subjected to, and I am sure that they would have already fixed it if they only knew how to… We are meanwhile down to ~600k edits a day, and WDQS is still continuously overloaded.
Overall the situation is disappointing, and editing Wikidata really feels like a waste of time these days. Many of our workflows rely heavily on the Query Service and on automated editing; query results outdated by hours and batches that take an eternity are frustrating. Maybe it is time to set up a serious plan how to fix this problem … —MisterSynergy (talk) 14:26, 4 February 2020 (UTC)

I don't know which is more frustrating: the lag, or the absence of any info / discussion from devs or community liaison. This is really Service Management 101.--Tagishsimon (talk) 15:15, 4 February 2020 (UTC)

So a check of "recent changes" just now shows a flurry of bot activity as soon as wdqs1004 lag went below 5 minutes (happened at around 16:15 GMT); then after a few minutes it rises above that and the bots mostly go away. Which is good that the bots are mostly now watching this parameter. But - lag is still many hours for 3 of the servers. It used to be that all four of the wdqs100* servers behaved pretty much the same way, but since Saturday wdqs1004 is somehow much better at keeping up with edits. Can any developer explain what's up here? @Lydia Pintscher (WMDE), Lucas Werkmeister (WMDE): who is supporting WDQS now? ArthurPSmith (talk) 16:26, 4 February 2020 (UTC)

Phabricator link for those interested: https://phabricator.wikimedia.org/T243701 Strainu (talk) 17:17, 4 February 2020 (UTC)

If it's accepted that not all servers have the same lag, maybe active contributors (and editing bots) should be enabled to connect to the server without any lag. --- Jura 17:25, 4 February 2020 (UTC)

Hello all,

First of all, let's try to keep the discussion calm and productive here. @Tagishsimon:, as it was already mentioned to you on Phabricator, no passive-agressive ranting will help any code to work better or people to answer faster. We should start with acknowledging that we're all in the same boat here, trying to make things work as best as they can. Direct or undirect attacks to other people are not acceptable and are not the good soil for collaborative problem-solving.

The issue seems to happen for a few days, the first message of this discussion was posted yesterday. Considering that employees have other things on their plate, and that no product or service is broken (the lag on WDQS is unfortunate but it doesn't prevent Wikidata from working), this is still a decent response delay. Again, let's assume that people do their best to react to issues as soon as possible.

To answer the question from @ArthurPSmith:: the servers of WDQS are taking care of by the Search Platform team at WMF. As you can read in this email from Guillaume Lederrey and this one, they are fully aware of the issue and they are working on a long-term solution. The primary goal of the Query Service is not necessarily to provide real-time information, and although we understand that this is how the community now expects it to work, in the current state, it cannot function this way anymore. We are working on understanding the issues, the needs from the community, in order to provide an adapted solution that will work on the long term.

If you are willing to let us know more in details about your current workflows, how you are using the WDQS in your daily Wikidata editing, and why it is important for you to have real-time data, feel free to use this page, we will be very happy to understand your needs better. In the meantime, I'll kindly ask you to be patient and respectful of other people's work.

Cheers, Lea Lacroix (WMDE) (talk) 17:26, 4 February 2020 (UTC)

I think there is a misunderstand: following a change made recently, the lag does prevent bots from editing and ultimately Wikidata from working. --- Jura 17:44, 4 February 2020 (UTC)

@Lea Lacroix (WMDE): Hi Lea, thanks for responding here. However, the emails you reference are from November; the belief at the time was that adding WDQS lag to the wikidata "maxlag" parameter would ensure that bots behave reasonably and the servers can catch up. See this phab ticket which was implemented after those emails. That solution worked for about 2 months; however, something broke on or around February 1, 2020, with the result that bots observing the maxlag constraint cannot edit for 80-90% of the time, and as has been complained here, the lag on some servers has grown to many hours, which causes other trouble. Wikidata is becoming close to unusable if this persists. ArthurPSmith (talk) 18:36, 4 February 2020 (UTC)

@Lea Lacroix (WMDE): It is not helpful to characterise my postings in this thread as attacks. One politely asks for information. The other informs of my frustration at no WMF interest in commenting. You merely pile disappointment on disappointment with what amounts to a dishonest response. --Tagishsimon (talk) 23:20, 4 February 2020 (UTC)

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘

Lea, I'm not seeing anything unproductive, and certainly not "ranting". You asked about workflows; how about this use-case - yesterday, I taught a class of 21 data journalism students (under- and post- graduate) about Wikidata. My method was to run a query showing public art in the host city, which had one result. I then taught them to edit Wikidata, and had them create items about all the local public art. My plan then was to run the query again to show the impact of their work, and then extend it to analyise creations over time, the gender and alma mater of the artists, the gender of the subjects, etc. and to identify missing data statements which they would then be tasked with adding. When I ran the query again, there was still just one result. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 09:15, 5 February 2020 (UTC)

Note that the lag has finally gone back down to a reasonable level today (Wednesday). Times when the "maxlag" value is below 5 (when bots are allowed to run) seem to be up to 25%-30% which is tolerable. Clearly there's still more demand for edits than the servers can handle though.... ArthurPSmith (talk) 21:37, 5 February 2020 (UTC)

Nothing changed. Script running users and bots ignoring „maxlag“ will win this „race“. --Succu (talk) 22:39, 5 February 2020 (UTC)

Hello all, here's an update email from the team taking care of the server side of the WDQS. Lea Lacroix (WMDE) (talk) 13:36, 7 February 2020 (UTC)

Should we block bot edits to descriptions of items for scholarly articles? --- Jura 13:56, 7 February 2020 (UTC)
@Lea Lacroix (WMDE): Thanks and thanks to Guillaume for the update, it sounds like there was more going on than we were aware of (but earlier communication on that would have been good!) Best of luck with the rewrite of the updater!!! ArthurPSmith (talk) 14:25, 7 February 2020 (UTC)

Hi folks, I got 85 batches on Quick Statements waiting to insert 1.7M claims of 2 tripes each. The speed is 30s *per statement*. I've read the above but don't know whether Quick Statements respects max lag, and can't figure out whether a speed up can be expected soon. More details at https://m.wikidata.org/w/index.php?title=Topic:Vfnr6v8lfqagpi7g . Thanks for any help! --Vladimir Alexiev (talk) 19:20, 8 February 2020 (UTC)

To compute the impact, you'd need to add the number of already available triples on each item. Is it existing triples 2* or 1* ? --- Jura 19:25, 8 February 2020 (UTC)

Each of my claims adds a Worldcat Identity, and a source (being VIAF ID), so 2 triples per claim. The items are persons, so the existing number of triples pee item varies widely --Vladimir Alexiev (talk) 21:08, 8 February 2020 (UTC)

Hello, today we increased the factor connecting maxlag to the WDQS lag, hoping that it will make the situation a bit easier for tool developers (phab:T244722). If you encounter further issues, please let me know. Lea Lacroix (WMDE) (talk) 15:46, 12 February 2020 (UTC)

@Lea Lacroix (WMDE): So what miraculous thing was done by the team yesterday to bring all the WDQS lags down to close to zero since about 22:00 (UTC) yesterday? I don't think edits have suddenly changed or dried up have they?? ArthurPSmith (talk) 13:20, 14 February 2020 (UTC)

Just curious about the traffic on query server .. do we have some idea about its nature? Given the number of active contributors on Wikidata, somehow I doubt they generate that much load (even indirectly). Even occasional researchers might not generate that much traffic. If the bulk of the traffic are from commercial users, I find it somewhat odd the WMF would subsidize their activity. --- Jura 10:18, 16 February 2020 (UTC)

@Lea Lacroix (WMDE): The pattern of behavior in WDQS lag for the last few days has been very peculiar, can you give us an update if something has been changed on the development side, or is it just changes in the pattern of updates and queries that is behind this? What I noticed was on Wednesday this pattern, where ALL 7 servers seemed to have almost the same lag, following the usual "sawtooth" together in sync. But then on Thursday this pattern, where the wdqs2* servers all retain pretty short lags and seem close together, while the wdqs1* servers split widely apart, with wdqs1004 with the shortest lag governing the "maxlag" parameter (as the median value), wdqs1005 and 1006 growing but somewhat catching up, and wdqs1007 just growing - it eventually reached almost 2 hours of lag, but then turned around and instead wdqs1005 took over as the slow one, and it now has almost 2-hour lag. It has the feeling like on Wednesday the load balancers were working correctly to spread query load evenly, but somehow that broke down on Thursday again? ArthurPSmith (talk) 14:08, 28 February 2020 (UTC)

@ArthurPSmith: honestly, we don't have a great understanding of the load patterns and the related impact of query load vs edit load on the lag. There is clearly multiple factors that affect the lag and they interact in complex ways. We are currently rewriting the updater with a different architecture (with and event streaming-like approach), which will completely change the interaction. So any knowledge that we could gain by analyzing the current patterns is unlikely to apply to the future situation. In short, we don't know and we're not looking into it at the moment. --GLederrey (WMF) (talk) 16:25, 28 February 2020 (UTC)
@GLederrey (WMF): Thanks for the update - so unless it was some other development group messing around with load balancing or something, it sounds like these pattern changes are caused by usage changes... it seems strange to me but maybe if the load balancing is geographical and query volume moves around from one part of the world to another that could explain it? ArthurPSmith (talk) 17:13, 28 February 2020 (UTC)

@GLederrey (WMF): would you look into my question from February 16th above? --- Jura 21:10, 3 March 2020 (UTC)
@Jura1: Not at the moment. We are planning to do some in depth analysis of the queries we get, but that's going to take time to do properly. We are focusing on fixing the things we know how to fix first. --GLederrey (WMF) (talk) 14:28, 5 March 2020 (UTC)

No Retry-After header with Error code 429

Hello,

I'm sometimes faced with the error code 429, I was expecting to find a Retry-After in the header of the answer but there wasn't in the answered one.

Is there supposed to be one? How could I check if the issue comes from my part or from the server?

Thanks! Theondrus (talk) 16:42, 3 March 2020 (UTC)

It is not entirely clear what service you are calling when getting 429 errors. I'm assuming this is about WDQS (https://query.wikidata.org/). The internal throttling of WDQS should always return a Retry-After header. But other parts of our traffic infrastructure might generate 429 responses in some cases, and might or might not add a Retry-After header. In general you should treat Retry-After as a hint which might or might not be present. --GLederrey (WMF) (talk) 19:43, 3 March 2020 (UTC)

Explain a query

I am trying to get query analysis as described in mw:Wikidata Query Service/User Manual#Explain Query, but I cannot get it to work for all queries. When I try a very simple query, like one with only a single triple, I get a reply with analysis of the query but only after about 30 seconds. For complicated queries, I most often get an empty reply. That is I get HTTP headers, but no content is received before the connection is closed after 30-40 seconds. The headers looks like this (without my IP address):

HTTP/2 200 
server: nginx/1.13.6
date: Thu, 05 Mar 2020 14:10:28 GMT
content-type: text/html
x-first-solution-millis: 29
x-served-by: wdqs1004
access-control-allow-origin: *
cache-control: public, max-age=300
x-envoy-upstream-service-time: 170
x-ats-timestamp: 1583417428
vary: Accept, Accept-Encoding
x-varnish: 129232485
age: 70
x-cache: cp3064 miss, cp3060 pass
x-cache-status: pass
server-timing: cache;desc="pass"
strict-transport-security: max-age=106384710; includeSubDomains; preload
set-cookie: WMF-Last-Access=05-Mar-2020;Path=/;HttpOnly;secure;Expires=Mon, 06 Apr 2020 12:00:00 GMT
set-cookie: WMF-Last-Access-Global=05-Mar-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Mon, 06 Apr 2020 12:00:00 GMT
x-client-ip: xxx.xxx.xxx.xxx
accept-ranges: bytes

I make the query as a POST request with urlencoded content using curl in a console on a Linux PC. If the query is in the file query.sparql the command line is:

curl --data-urlencode query@query.sparql --data-urlencode explain=details https://query.wikidata.org/sparql

Can you give any advice for getting it to work better. Should the query method be changed? --Dipsacus fullonum (talk) 15:07, 5 March 2020 (UTC)

@Dipsacus fullonum: I don’t use the explain feature very often, but as far as I’m aware, it first runs the query and then gives you the report (unlike an SQL EXPLAIN). So my best guess is that your complicated queries time out, and the connection closes after the usual 60 s timeout? (Though that would be longer than the 30-40 seconds you’re reporting.) And the delay for the simple query could be because it’s a query that returns a lot of triples, like SELECT * { ?x wdt:P31 wd:Q5. }. For what it’s worth, with ASK { wd:Q42 wdt:P31 wd:Q5. } , I reliably get a reply within half a second. --Lucas Werkmeister (WMDE) (talk) 15:36, 5 March 2020 (UTC)

Thank for the reply, Lucas Werkmeister (WMDE). The connection is definitely closed before 60 seconds with no data sent besides the headers. I don't understand why the query should be run. The explanation (when present) doesn't contain any results from running the query. And when the explain is used to try to figure out how to improve a query which times out, it is in fact counterproductive to also try run it ... --Dipsacus fullonum (talk) 16:04, 5 March 2020 (UTC)

@Lucas Werkmeister (WMDE):I did some more experiments and it seems a good explanation that query is run (even though I don't know why), as I get fast results when trying to explain simple queries that can run fast like "SELECT ?what { wd:Q42 wdt:P31 ?what. }", but no reply for simple queries that times out like "SELECT ?what { ?who wdt:P31 ?what. }". But the connection is closed after about 35 seconds for this query (which also times out when run):

$ cat query.sparql 
SELECT ?muni ?key ?muniLabel ?instanceLabel
WHERE {
  VALUES ?instance {wd:Q42744322} .
  ?muni wdt:P439 ?key .
  ?muni p:P31 ?instanceBlock . 
  ?instanceBlock ps:P31 ?instance .
  FILTER NOT EXISTS {?instanceBlock pq:P582 ?end} .
  FILTER NOT EXISTS {
    ?muni p:P1082 ?populationBlock .
    ?populationBlock ps:P1082 ?population .
    ?populationBlock wikibase:rank wikibase:PreferredRank .
    ?populationBlock pq:P585 "2019-09-30"^^xsd:dateTime
  }
  SERVICE wikibase:label{bd:serviceParam wikibase:language "de"}
}
ORDER BY ?key
$ time curl --data-urlencode query@query.sparql --data-urlencode explain=details https://query.wikidata.org/sparql -s > query.explain.html

real	0m35,657s
user	0m0,025s
sys	0m0,008s
$ wc query.explain.html 
0 0 0 query.explain.html

--Dipsacus fullonum (talk) 16:42, 5 March 2020 (UTC)