Open main menu

Wikidata:SPARQL query service/query optimization

< Wikidata:SPARQL query service

Query optimization attempts to find the most efficient way to execute a given query from among the different possible query execution plans. Blazegraph has a built-in query optimizer which often works well. However, sometimes it is not so successful; in such cases queries may need to be optimised manually.

Contents

Optimization strategiesEdit

 
Query Optimization (symbolic picture)

Fixed values and rangesEdit

Searching for fixed values, e. g. wdt:P31 wd:Q5, is the cheapest option in the query service, but searching for ranges of values is also usually pretty efficient, and often preferable to other options. For example, to look for people born in 1978, FILTER(YEAR(?dateOfBirth) = 1978) is not nearly as efficient as the following:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}

Try it!

You can further optimize this by informing the query service that wdt:P569 is a range-safe predicate:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth. hint:Prior hint:rangeSafe true.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}

Try it!

This tells the optimizer that ?dateOfBirth doesn’t mix dates, strings, integers or other data types, which simplifies the range comparison. This is almost always true on Wikidata, so the main reason not to use this optimizer hint all the time is that it’s a bit inconvenient. (It’s probably also not a good idea to use it when you’re working with unknown values.)

Property pathsEdit

The optimizer generally assumes property paths to be more expensive than simple value matches, and usually that’s correct: try to add some simple triples to reduce the amount of values that have to be checked against the path. However, sometimes paths can be fairly efficient, and it’s good to inform the optimizer of that fact. For example, consider the following query for all municipalities in the German state of Lower Saxony:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q262166;
        wdt:P131+ wd:Q1197.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

As written, this will time out because the query service will attempt to start with all instance of (P31) municipality of Germany (Q262166) (including subclasses) before limiting them to located in the administrative territorial entity (P131) Lower Saxony (Q1197) (recursively). It’s much more efficient to do it the other way around:

SELECT ?item ?itemLabel WHERE {
  hint:Query hint:optimizer "None".
  ?item wdt:P131+ wd:Q1197;
        wdt:P31/wdt:P279* wd:Q262166.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

You can also tell the optimizer in which direction to traverse a path. For example, say you want to test whether Moby-Dick (Q174596) is a creative work (Q17537576):

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
}

Try it!

This takes several seconds because the optimizer decides to start at creative work (Q17537576) and walk the subclass of (P279) links backwards. You can tell it to go the other way around, starting at Moby-Dick (Q174596), following one instance of (P31) link and then walking the subclass of (P279) link forwards:

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
  hint:Prior hint:gearing "forward".
}

Try it!

This is much more efficient. To go the other way, use "reverse" instead of "forward".

Order tweakingEdit

We’ve already disabled the optimizer a few times, when it tried to rearrange the query the wrong way. You can also apply some more fine-grained control to the optimizer by informing it which joins it should run first or last, using the hints hint:Prior hint:runFirst true or hint:Prior hint:runLast true. One important thing to keep in mind is that these control the first or last join, which is more or less the . between two triples – not a triple directly. In particular, you can’t put this hint after the first triple in a group, nor (I think) after a triple containing a property path. See also the sections on named subqueries below for further options to control query execution order.

Label serviceEdit

The label service (SERVICE wikibase:label { }) can make queries much less efficient – when working with an expensive query, it’s often a good idea to disable the service at first, try to optimize the query as much as possible, and only then turn it back on. With some luck, the query will be efficient enough that the query service doesn’t cause it to time out now.

If the query works without the label service but results in a timeout when it is added, it can help to extract the rest of the query into a subquery, and only apply the label service at the very end. You can do this with a regular SPARQL subquery:

SELECT ?item ?itemLabel  WHERE {
  {
    SELECT ?item  WHERE {
      
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

This can be particularly important if you are using LIMIT to get back a few quick results from a query that would otherwise be very expensive.

The label service tries to run last, which means Blazegraph tries to materialise all the results of a query before moving on to adding the labels, only then applying the LIMIT directive. This can be avoided if the expensive part of the query is put into a subquery, as above, and the LIMIT is placed on the subquery, so the labels are looked up only for those few results.

Named subqueriesEdit

As mentioned below, the optimizer sometimes tries to optimize the whole query across subqueries, so putting part of the query into a subquery doesn’t always help. You can fix this by using a BlazeGraph extension, called named subqueries:

SELECT ?item ?itemLabel 

WITH {
  SELECT ?item  WHERE {
     
  }
} AS %results 

WHERE {
  INCLUDE %results.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

Named subqueries are guaranteed to run only once, on their own, which is just what we want here.

Further uses of named subqueriesEdit

The above example only uses one named subquery, but you can use any number of them. This can be useful to impose a general order or structure on the query, while still letting the optimizer rearrange the instructions within each subquery as it knows best. You can also disable the optimizer only in one named subquery, while still letting it optimize the rest of the query: for this, specify hint:SubQuery hint:optimizer "None" in that subquery instead of the usual hint:Query hint:optimizer "None". See User:TweetsFactsAndQueries/Queries/name phrases for an example using this technique.

Named subqueries are also useful when you want to group results multiple times, or want to use the same subquery results more than once in different contexts – you can INCLUDE the same named subquery into multiple other ones, if necessary (and it will be run only once). See User:TweetsFactsAndQueries/Queries/most common election days for an example of this.

Automatic optimisation -- background overviewEdit

The basic approach to query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be. A subsidiary goal can be to maximise the possibility of pipelining and parallelisation. This is what Blazegraph's built-in query optimiser tries to achieve.

For example, here is a query to return a list of U.S. Presidents and their spouses:

SELECT ?pres ?presLabel ?spouse ?spouseLabel WHERE {
   ?pres wdt:P31 wd:Q5 .
   ?pres wdt:P39 wd:Q11696 .
   ?pres wdt:P26 ?spouse .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
 }

Try it!

The details of how Blazegraph has executed a particular query (and why) can be obtained via the query engine's EXPLAIN mode. This can be done by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

to send the query to the SPARQL endpoint directly (not via the query editor), with an extra explain& in the URL before the query= string.

The resulting URL, sending the query above to the endpoint directly, with the 'explain' option, produces this report on the query and its optimisation.

By default, a query would execute from top to bottom, with solution-set joins made in the order given. This is shown as the "Original AST" plan in the report. However, the query engine estimates (as of September 2015) that there are

It therefore concludes that the most efficient way to run the query, rather than the order specified, is instead to first find the presidents (76), then see how many of them have spouses (51 solutions), then see how many are human (47 solutions), before sending this solution set to the labelling service. This rearrangement is typical of successful query optimisation.

A query that has difficultiesEdit

The following query attempts to find all statements referenced to the Le Figaro website, and return their subject, predicate and object. However as written it times out.

prefix prov: <http://www.w3.org/ns/prov#> 
prefix pr: <http://www.wikidata.org/prop/reference/>SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WHERE {
  ?statement prov:wasDerivedFrom ?ref .
  ?ref pr:P854 ?refURL 
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .       
  ?subject ?p ?statement .
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

Try it!

The reason for this can be investigated by looking at the explain report for the query.

It appears that the query optimiser is seduced by the low cardinality of the statements ?property wikibase:claim ?p and ?property wikibase:statementProperty ?ps, with only 1756 properties that take items as their objects, without anticipating that (i) the second condition will do nothing to reduce the solution set from the first condition; and (ii) these conditions will do very little to reduce the cardinality of the statement ?subject ?p ?statement (764,223,907), which the optimiser proposes to join before looking at ?statement prov:wasDerivedFrom ?ref and then ?ref pr:P854 ?refURL.

The query optimiser may also be trying to defer testing the statement FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) for as long as possible, as this requires materialisation of actual string values, rather than proceeding solely with joins based on whether items exist in statements or not.

any examples for hand-ordered query?

Other optimisation modesEdit

Blazegraph offers a handful of other optimisation directives including the more expensive "runtime optimisation", based on using sampling to more precisely estimate the likely development in the size of the solution set by a particular sequence of joins; and a directive to run a particular join first.

However, examining the corresponding query report, it appears that runtime optimisation has proceeded with the same query sequence as the static optimisation, and similarly times out.

As for the "runFirst" directive, I have not yet been able to work out how to apply it.

Other queries the optimiser has difficulty withEdit

Performance falls off a cliff, when the number in a group of interest exceeds a certain thresholdEdit

This arose in connection with a query to find humans with apparently matching dates of birth and death, for humans from a particular country

  • France (N=125,223): tinyurl.com/obxk9av -- query report (successful)
  • Germany (N=194,098): tinyurl.com/pwthc9d query report (times out)

What goes wrong here occurs when the number of items with the given value of country of citizenship (P27), eg 194,098 for Germany, exceeds the items (time-value nodes in fact) with wikibase:timePrecision "11"^^xsd:integer (169,902).

For France, with N=125,223, there is a steady fall-off in the size of solution set: requiring a date of death, then day-specific precision, then a date of birth, then day-specific precision, knocks this number down to 48,537, which is then reduced to 10 by requiring the date of birth and the date of death to match (a very quick operation).

However for Germany, with N=194,098, the query optimiser attempts to start with the 169,902 date-nodes with day-specific dates; but the query then explodes when the system tries to match these to all the people with such day-specific birthdates (1,560,806), a number which reduces when the nationality filter kicks in; but which still eats up the available query time.

(In fact, having looked at the performance data and seen just how fast Blazegraph can do all-against-all matching for really very big sets, it turns out that (with a query rewrite) the system is fast enough to run this kind of query for everything v everything -- see tinyurl.com/pn9xyye -- by matching on date-nodes first, which knocks down the size of the solution set, and not checking until after that the requirement that the date be day-specific).

Adding a names and name-labels wrapper to a query for life events makes it time outEdit

Adding an apparently trivial wrapper, to find the name and name-label, causes the query in the immediately previous section to time out, even when without the wrapper it ran with no difficulty.

  • Query as presented: tinyurl.com/p8wzb74 -- query report
  • Hand-ordered query: tinyurl.com/o7z95vb -- query report (successful)
  • Query without the name lookup: tinyurl.com/obxk9av -- query report (successful)

With the outer-layer in place, the inner layer now starts (even for France) by extracting the birth-nodes that are connected to birth-dates, and then keeping those which have precision=11.

The situation can be fixed by turning off the query optimiser and hand-ordering the query.

It seems this may be a general feature of queries with sub-select clauses: both here and in the next query the optimiser seems to want to start the sub-query with a requirement that includes one of the variable that will be projected to the outer layer.

Adding a property-labels wrapper to a query for properties makes it time outEdit

  • Query as presented: tinyurl.com/oex59qr -- query report
  • Hand-ordered query: tinyurl.com/ppkmb3v -- query report (successul)
  • Query without property-label lookup: tinyurl.com/qe4ukbx -- query report (successful)

Same as the above: for the query including property look-ups, the optimiser apparently wants to start with all triples ?a ?p ?b and then join ?b wdt:P31 wd:Q4167410, rather than the other way round (which is what it manages to successfully do for the smaller query with no property look-up).

This doesn't appear to an artifact caused by BIND, nor the order the clauses are presented -- even if these are changed, the query still runs slow with the built-in optimiser

  • Alt query: tinyurl.com/q97a8sh -- query report (still fails)

It is still preferring to start with the assertion ?a ?p ?b, even though this is less specific -- apparently because ?p is one of the variables projected out of the inner SELECT statement.

More... ?Edit

Add further cases here