Wikidata:SPARQL query service/query optimization

Query optimization attempts to find the most efficient way to execute a given query from among the different possible query execution plans. Blazegraph has a built-in query optimizer which often works well. However, sometimes it is not so successful; in such cases queries may need to be optimised manually.

Optimization strategies edit

 
Query Optimization (symbolic picture)

Fixed values and ranges edit

Searching for fixed values, e. g. wdt:P31 wd:Q5, is the cheapest option in the query service, but searching for ranges of values is also usually pretty efficient, and often preferable to other options. For example, to look for people born in 1978, FILTER(YEAR(?dateOfBirth) = 1978) is not nearly as efficient as the following:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}
Try it!

You can further optimize this by informing the query service that wdt:P569 is a range-safe predicate:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth. hint:Prior hint:rangeSafe true.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}
Try it!

This tells the optimizer that ?dateOfBirth doesn’t mix dates, strings, integers or other data types, which simplifies the range comparison. This is almost always true on Wikidata, so the main reason not to use this optimizer hint all the time is that it’s a bit inconvenient. (It’s probably also not a good idea to use it when you’re working with unknown values.)

Property paths edit

The optimizer generally assumes property paths to be more expensive than simple value matches, and usually that’s correct: try to add some simple triples to reduce the amount of values that have to be checked against the path. However, sometimes paths can be fairly efficient, and it’s good to inform the optimizer of that fact. For example, consider the following query for all municipalities in the German state of Lower Saxony:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q262166;
        wdt:P131+ wd:Q1197.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

As written, this will time out because the query service will attempt to start with all instance of (P31)municipality in Germany (Q262166) (including subclasses) before limiting them to located in the administrative territorial entity (P131)Lower Saxony (Q1197) (recursively). It’s much more efficient to do it the other way around:

SELECT ?item ?itemLabel WHERE {
  hint:Query hint:optimizer "None".
  ?item wdt:P131+ wd:Q1197;
        wdt:P31/wdt:P279* wd:Q262166.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

You can also tell the optimizer in which direction to traverse a path. For example, say you want to test whether Moby-Dick (Q174596) is a creative work (Q17537576):

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
}
Try it!

This takes several seconds because the optimizer decides to start at creative work (Q17537576) and walk the subclass of (P279) links backwards. You can tell it to go the other way around, starting at Moby-Dick (Q174596), following one instance of (P31) link and then walking the subclass of (P279) link forwards:

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
  hint:Prior hint:gearing "forward".
}
Try it!

This is much more efficient. To go the other way, use "reverse" instead of "forward".

Inverse property paths edit

Another technique for subverting the optimizer into delivering results is to invert the property path used in the query; possibly particularly useful when dealing with timeouts of reports having wdt:P31/wdt:P279* constructions. Consider the following examples - the first times out, the second works speedily:

# Museums in Northern Ireland - TIMES OUT
SELECT distinct ?museum ?museumLabel WHERE {
  ?museum wdt:P131* wd:Q26 .
  ?museum wdt:P31/wdt:P279* wd:Q33506 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!
# Museums in Northern Ireland - SUCCEEDS
SELECT distinct ?museum ?museumLabel WHERE {
  ?museum wdt:P131* wd:Q26 .
  wd:Q33506 ^wdt:P279*/^wdt:P31 ?museum .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

More details on the property path can be found in SPARQL standard.

Order tweaking edit

We’ve already disabled the optimizer a few times, when it tried to rearrange the query the wrong way. You can also apply some more fine-grained control to the optimizer by informing it which joins it should run first or last, using the hints hint:Prior hint:runFirst true or hint:Prior hint:runLast true.

One important thing to keep in mind is that these control the first or last join, which is more or less the . between two triples – not a triple directly. In particular, you can’t put this hint after the first triple in a group, nor (I think) after a triple containing a property path. See also the sections on named subqueries below for further options to control query execution order.

Delay call to label service edit

The label service (SERVICE wikibase:label { }) can make queries much less efficient – when working with an expensive query, it’s often a good idea to disable the service at first, try to optimize the query as much as possible, and only then turn it back on. With some luck, the query will be efficient enough that the query service doesn’t cause it to time out now.

If the query works without the label service but results in a timeout when it is added, it can help to extract the bulk of the query into a subquery, and only apply the label service at the very end. You can do this with a regular SPARQL subquery:

SELECT ?foo ?fooLabel ?bar ?barLabel  WHERE {
  { # Wrapper for label-less subquery
    SELECT ?foo ?bar  WHERE {
      
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

This can be particularly important if you are using LIMIT to get back a few quick results from a query that would otherwise be very expensive.

The label service tries to run last, which means Blazegraph tries to materialise all the results of a query before moving on to adding the labels, only then applying the LIMIT directive. This can be avoided if the expensive part of the query is put into a subquery, as above, and the LIMIT is placed on the subquery, so the labels are looked up only for those few results.

Another way to reduce usage of the label service is using rdfs:label. To get all proteins and their labels:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
Try it!

will timeout because 1 million hits the limit for the label service. Instead get all proteins with labels:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}
Try it!

Note that this will output only those items that have an English label, though.

Named subqueries edit

As mentioned below, the optimizer sometimes tries to optimize the whole query across subqueries, so putting part of the query into a subquery doesn’t always help. You can fix this by using a BlazeGraph extension, called named subqueries:

SELECT ?item ?itemLabel 

WITH {
  SELECT ?item  WHERE {
     
  }
} AS %results 

WHERE {
  INCLUDE %results.
     
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

Named subqueries are guaranteed to run only once, on their own, which is just what we want here.

Further uses of named subqueries edit

The above example only uses one named subquery, but you can use any number of them. This can be useful to impose a general order or structure on the query, while still letting the optimizer rearrange the instructions within each subquery as it knows best.

You can also disable the optimizer only in one named subquery, while still letting it optimize the rest of the query: for this, specify hint:SubQuery hint:optimizer "None" in that subquery instead of the usual hint:Query hint:optimizer "None".

See User:TweetsFactsAndQueries/Queries/name phrases for an example using this technique.

Named subqueries are also useful when you want to group results multiple times, or want to use the same subquery results more than once in different contexts – you can INCLUDE the same named subquery into multiple other ones, if necessary (and it will be run only once). See User:TweetsFactsAndQueries/Queries/most common election days for an example of this.

Searching labels edit

If you want to find all items that have a certain word, say "Frankfurt" in a label, you could try a query like this:

SELECT ?item ?label
WHERE
{
  ?item rdfs:label ?label.
  FILTER CONTAINS(?label, "Frankfurt")
}
Try it!

But that query will timeout because the query engine would have to the read hundreds of millions of labels to find the word. But Wikidata's normal search function have all words in labels indexed and can make the search very fast. It is possible for a query to access Wikidata's search function with the MediaWiki API Query Service. The query below also finds items with the word "Frankfurt" in a label but uses the MediaWiki API, and can therefore run without timeout:

SELECT ?item ?label
WHERE
{
  SERVICE wikibase:mwapi
  {
    bd:serviceParam wikibase:endpoint "www.wikidata.org";
                    wikibase:api "Generator";
                    mwapi:generator "search";
                    mwapi:gsrsearch "inlabel:Frankfurt";
                    mwapi:gsrlimit "max".
    ?item wikibase:apiOutputItem mwapi:title.
  }
  ?item rdfs:label ?label.
  FILTER CONTAINS(?label, "Frankfurt")
}
Try it!

The MWAPI search can also include search for labels in selected languages only, and search statements and qualifiers and more. Please see mw:Help:Extension:WikibaseCirrusSearch for a description of the search options.

Use COUNT(*) when possible, and fast range counts edit

If you want to know the number of persons in Wikidata you might make this query:

SELECT (COUNT(?item) AS ?count)
{
  ?item wdt:P31 wd:Q5 .
}
Try it!

It looks simple but takes a long time to run. The SPARQL engine must check if ?item is bound and has a non-error value over 8 million times so the query typically takes 20–30 seconds to run. But we know that ?item is always bound and that the values are always items, so it is better to just count the number of results without conditions using this query:

SELECT (COUNT(*) AS ?count)
{
  ?item wdt:P31 wd:Q5 .
}
Try it!

Here the engine doesn't have to examine each result to make the count, and query uses less than a second to run.

While counting the number of results is faster than counting the number of a specific binding in general, the reason this example is particularly fast is that the optimizer will rewrite a count over a single triple pattern to use fast range counts. This optimization will not fire if you try to count a set that consists of more than one single triple pattern.

Distinct term scan, and group by and count optimization edit

If you try to list unique predicates like so, the operation will time out.

SELECT ?p
WHERE { [] ?p [] . }
GROUP BY ?p
Try it!

However, if you alter this query slightly, it will be rewritten by the optimizer to use a distinct term scan, which is quite fast. This only works on a single triple pattern.

SELECT DISTINCT ?p
WHERE { [] ?p [] . }
Try it!

There is also a simple group by and count optimization that will combine both the distinct term scan and the fast range count optimizations, the latter for every predicate. This is a lightningly fast way to get counts for all the different predicates. If you need only counts for a subset of the predicates it will most likely be fastest to start with this, wrap it unchanged in a named subquery, then reduce the set down to what you want.

SELECT ?p (COUNT(?p) AS ?count)
WHERE { [] ?p [] . }
GROUP BY ?p
Try it!

This optimization should work for any component of the triple.

SELECT ?o (COUNT(*) as ?count)
WHERE { ?s wdt:P31 ?o }
GROUP BY ?o
Try it!


Count sizes of huge subsets intersections with WikibaseCirrusSearch edit

Blazegraph often fails when counting sizes of intersections of huge subsets, for example, it is impossible to count number of humans without gender property set with only SPARQL-based optimizations. However with MWAPI and WikibaseCirrusSearch it works extremely fast.

SELECT ?label ?count WHERE {
  {
    BIND("People without gender" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 -haswbstatement:P21" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Male" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21=Q6581097" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Female" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21=Q6581072" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Other" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21 -haswbstatement:P21=Q6581097 -haswbstatement:P21=Q6581072" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  }
}
Try it!

Services edit

GAS Service edit

The Gather, Apply, and Scatter (GAS) service provides graph traversal, graph mining, and similar classes of algorithms for SPARQL. In practical terms, it enables a series of relations to be followed through the graph; for instance the chain of father, grandfather, great-grandfather &c - tracing the father (P22) line - of a subject item, or a chain of connected railway stations through adjacent station (P197) statements. It is documented here & here.

In essence, GAS takes two parameters: gas:in wd:Qnnn - the node or item from which to start, and gas:linkType wdt:Pnnn - the property to be followed. It outputs gas:out - the next item; gas:out1 - the number of hops from the input item and gas:out2 - the item immediately preceding the gas:out item in the queried chain.

# Male lineage of Elizabeth II (Q9682)
SELECT ?father ?fatherLabel ?child ?childLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P22 ; gas:out ?father ; gas:out1 ?depth ; gas:out2 ?child . }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth
Try it!
# UK railway stations connected from Topsham railway station (Q1990205) 
#defaultView:Map{"hide":["?cds","?line","?layer"],"layer":"?trackLabel"}
SELECT ?station ?stationLabel ?cds ?line ?trackLabel ?track ?image {
{SELECT * {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q1990205 ; gas:linkType wdt:P197 ; gas:out ?station ; gas:out1 ?depth ; gas:out2 ?pred . }
  FILTER(?station != wd:Q1990205)
                      ?station wdt:P625 ?cds ;
                               wdt:P17 wd:Q145 .
                             
  ?station p:P197 ?adjST .
  ?adjST ps:P197 ?adj ;
                pq:P81 ?track .

  ?adj p:P625/psv:P625/wikibase:geoLatitude ?lat1 ; p:P625/psv:P625/wikibase:geoLongitude ?lon1 .
  ?station p:P625/psv:P625/wikibase:geoLatitude ?lat2 ; p:P625/psv:P625/wikibase:geoLongitude ?lon2 .
  BIND(CONCAT('LINESTRING(', STR(?lon1), ' ', STR(?lat1), ',', STR(?lon2), ' ', STR(?lat2), ')') AS ?str) . BIND(STRDT(?str, geo:wktLiteral) AS ?line)}
 
}
  OPTIONAL { ?station wdt:P18 ?image .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Try it!

It is possible to look at inverse paths to a subject item, for instance, people in a matrilineal line descended from Elizabeth II (Q9682), by adding the parameter gas:traversalDirection "Reverse".

# Female lines descended from Elizabeth II (Q9682)
SELECT ?child ?childLabel ?mother ?motherLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P25 ; gas:traversalDirection "Reverse"; gas:out ?child ; gas:out1 ?depth ; gas:out2 ?mother . 
  }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth
Try it!

It is also possible to follow both directions, from and to the item, using gas:traversalDirection "Undirected", although as in the example below, this results in confusion in the parent & child columns.

# Female line leading to, and descended from, Elizabeth II (Q9682)
SELECT ?child ?childLabel ?mother ?motherLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P25 ; gas:traversalDirection "Undirected"; gas:out ?child ; gas:out1 ?depth ; gas:out2 ?mother . 
  }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth
Try it!

The documentation discusses some additional parameters and GAS algorithms.

Automatic optimisation -- background overview edit

The basic approach to query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be. A subsidiary goal can be to maximise the possibility of pipelining and parallelisation. This is what Blazegraph's built-in query optimiser tries to achieve.

For example, here is a query to return a list of U.S. Presidents and their spouses:

SELECT ?pres ?presLabel ?spouse ?spouseLabel WHERE {
   ?pres wdt:P31 wd:Q5 .
   ?pres wdt:P39 wd:Q11696 .
   ?pres wdt:P26 ?spouse .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
 }
Try it!

The details of how Blazegraph has executed a particular query (and why) can be obtained via the query engine's EXPLAIN mode. This can be done by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

to send the query to the SPARQL endpoint directly (not via the query editor), with an extra explain& in the URL before the query= string.

The resulting URL, sending the query above to the endpoint directly, with the 'explain' option, produces this report on the query and its optimisation.

By default, a query would execute from top to bottom, with solution-set joins made in the order given. This is shown as the "Original AST" plan in the report. However, the query engine estimates (as of September 2015) that there are

It therefore concludes that the most efficient way to run the query, rather than the order specified, is instead to first find the presidents (76), then see how many of them have spouses (51 solutions), then see how many are human (47 solutions), before sending this solution set to the labelling service. This rearrangement is typical of successful query optimisation.

A query that has difficulties edit

The following query attempts to find all statements referenced to the Le Figaro website, and return their subject, predicate and object. However as written it times out.

SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WHERE {
  ?statement prov:wasDerivedFrom ?ref .
  ?ref pr:P854 ?refURL 
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .       
  ?subject ?p ?statement .
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
Try it!

The reason for this can be investigated by looking at the explain report for the query.

It appears that the query optimiser is seduced by the low cardinality of the statements ?property wikibase:claim ?p and ?property wikibase:statementProperty ?ps, with only 1756 properties that take items as their objects, without anticipating that (i) the second condition will do nothing to reduce the solution set from the first condition; and (ii) these conditions will do very little to reduce the cardinality of the statement ?subject ?p ?statement (764,223,907), which the optimiser proposes to join before looking at ?statement prov:wasDerivedFrom ?ref and then ?ref pr:P854 ?refURL.

The query optimiser may also be trying to defer testing the statement FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) for as long as possible, as this requires materialisation of actual string values, rather than proceeding solely with joins based on whether items exist in statements or not.

Update: as for 2020, there are more than 47000000 url references in Wikidata. As Blazegraph in current configuration does not perform full-text indexing for text values, there is no way to create quick "native" query without adding extra filters (e. g. "find all statements referenced to the Le Figaro, but only for presidents"). Luckily, there is an out-of the box solution, which involves MWAPI. Note, that this query is slightly different from original request: it returns links for domains matching *.lefigaro.fr, it won't capture links like http://web.archive.org/web/20200904002919/http://www.lefigaro.fr/. This query looks like this:

SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WITH
{
  SELECT DISTINCT ?subject WHERE {
    {
      SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "Generator"; mwapi:generator "exturlusage"; mwapi:geulimit "500"; mwapi:geuquery "*.lefigaro.fr"; mwapi:geuprotocol "http"; mwapi:geunamespace "0" .
        ?title wikibase:apiOutput mwapi:title.
      }
    } UNION {
      SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "Generator"; mwapi:generator "exturlusage"; mwapi:geulimit "500"; mwapi:geuquery "*.lefigaro.fr"; mwapi:geuprotocol "https"; mwapi:geunamespace "0" .
        ?title wikibase:apiOutput mwapi:title.
      }
    }
    BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?subject)
  }
} AS %items {
  INCLUDE %items .
  
  hint:Query hint:optimizer "None".
  
  ?subject ?p ?statement .
  ?statement prov:wasDerivedFrom/pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL), 'lefigaro.fr')) .
  
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object .
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
Try it!

Other optimisation modes edit

Blazegraph offers a handful of other optimisation directives including the more expensive "runtime optimisation", based on using sampling to more precisely estimate the likely development in the size of the solution set by a particular sequence of joins; and a directive to run a particular join first.

However, examining the corresponding query report, it appears that runtime optimisation has proceeded with the same query sequence as the static optimisation, and similarly times out.

As for the "runFirst" directive, I have not yet been able to work out how to apply it.

Other queries the optimiser has difficulty with edit

Performance falls off a cliff, when the number in a group of interest exceeds a certain threshold edit

This arose in connection with a query to find humans with apparently matching dates of birth and death, for humans from a particular country

  • France (N=125,223): tinyurl.com/obxk9av -- query report (successful)
  • Germany (N=194,098): tinyurl.com/pwthc9d query report (times out)

What goes wrong here occurs when the number of items with the given value of country of citizenship (P27), eg 194,098 for Germany, exceeds the items (time-value nodes in fact) with wikibase:timePrecision "11"^^xsd:integer (169,902).

For France, with N=125,223, there is a steady fall-off in the size of solution set: requiring a date of death, then day-specific precision, then a date of birth, then day-specific precision, knocks this number down to 48,537, which is then reduced to 10 by requiring the date of birth and the date of death to match (a very quick operation).

However for Germany, with N=194,098, the query optimiser attempts to start with the 169,902 date-nodes with day-specific dates; but the query then explodes when the system tries to match these to all the people with such day-specific birthdates (1,560,806), a number which reduces when the nationality filter kicks in; but which still eats up the available query time.

(In fact, having looked at the performance data and seen just how fast Blazegraph can do all-against-all matching for really very big sets, it turns out that (with a query rewrite) the system is fast enough to run this kind of query for everything -- see tinyurl.com/pn9xyye -- by matching on date-nodes first, which knocks down the size of the solution set, and not checking until after that the requirement that the date be day-specific).

Adding a names and name-labels wrapper to a query for life events makes it time out edit

Adding an apparently trivial wrapper, to find the name and name-label, causes the query in the immediately previous section to time out, even when without the wrapper it ran with no difficulty.

  • Query as presented: tinyurl.com/p8wzb74 -- query report
  • Hand-ordered query: tinyurl.com/o7z95vb -- query report (successful)
  • Query without the name lookup: tinyurl.com/obxk9av -- query report (successful)

With the outer-layer in place, the inner layer now starts (even for France) by extracting the birth-nodes that are connected to birth-dates, and then keeping those which have precision=11.

The situation can be fixed by turning off the query optimiser and hand-ordering the query.

It seems this may be a general feature of queries with sub-select clauses: both here and in the next query the optimiser seems to want to start the sub-query with a requirement that includes one of the variable that will be projected to the outer layer.

Adding a property-labels wrapper to a query for properties makes it time out edit

  • Query as presented: tinyurl.com/oex59qr -- query report
  • Hand-ordered query: tinyurl.com/ppkmb3v -- query report (successul)
  • Query without property-label lookup: tinyurl.com/qe4ukbx -- query report (successful)

Same as the above: for the query including property look-ups, the optimiser apparently wants to start with all triples ?a ?p ?b and then join ?b wdt:P31 wd:Q4167410, rather than the other way round (which is what it manages to successfully do for the smaller query with no property look-up).

This doesn't appear to an artifact caused by BIND, nor the order the clauses are presented -- even if these are changed, the query still runs slow with the built-in optimiser

  • Alt query: tinyurl.com/q97a8sh -- query report (still fails)

It is still preferring to start with the assertion ?a ?p ?b, even though this is less specific -- apparently because ?p is one of the variables projected out of the inner SELECT statement.

Queries with FILTER NOT EXISTS edit

To work fine, queries with “filter not exists” often seem to have to be rewritten.

The “filter not exists” clause seems to often be inefficient for unclear reasons. For example a query like this one, featured articles in frwiki which does not exists at all on enwiki times out expressed as a query with “filter not exists”

select ?articleFr ?item ?itemLabel ?some ?someLabel {
    select * {?articleFr schema:about ?item ;
             wikibase:badge ?some ;
             schema:isPartOf <https://fr.wikipedia.org/>
  
    filter not exists {
      ?articleEn schema:about ?item ;
               schema:isPartOf <https://en.wikipedia.org/>
    }
   }
}
Try it!

however are fine just rewriting using an “optional”, with filtering on a unbound variable

select ?articleFr ?item ?itemLabel ?some ?someLabel {
  {
    select * {?articleFr schema:about ?item ;
             wikibase:badge ?some ;
             schema:isPartOf <https://fr.wikipedia.org/>
  
      optional { ?articleEn schema:about ?item ;
               schema:isPartOf <https://en.wikipedia.org/> .
      }
      filter (!bound(?articleEn))
   }
  }
}
Try it!


Using MINUS() also works well in this instance. These three methods of removing results all does something similar but goes about it differently. There is not one method which is consistently better/worse, they will all perform differently given the circumstances. In other words you ought to try them all if you are serious about performance.

Rules of Thumb edit

Here's my explanation of the situation:

  • FILTER NOT EXISTS needs to check a condition on each solution of the subquery. These conditions are checked in sequence, not by join
  • OPTIONAL fetches an extra triple pattern related to each solution. This is done by join, i.e. all at once. The FILTER NOT BOUND then can quickly check each row of this extended resultset
  • MINUS computes the second subquery, then finds the set difference of the first and second subqueries. If both are relatively small, the difference can be computed quickly

So a rule of thumb is:

  • Use FILTER NOT EXISTS when the negative condition to check produces a huge resultset, so the NOT EXISTS may be less expensive
  • Use OPTIONAL... FILTER NOT BOUND when the negative condition extends the positive resultset with only a few more rows (or keeps the same rows)
  • Use MINUS when the negative condition, taken on its own, produces a small resultset

Dealing with cached queries edit

This post may be useful for folks who want to makes sure queries are freshly executed:

Short retell: to disable query caching add cache-control: no-cache header. Alternatively you can just replace GET method with POST. Results of POST requests are never cached: if you want to get a cached response, switch from POST to GET.

More... ? edit

Add further cases here