Wikidata:SPARQL query service/WDQS graph split/Federation Limits

SPARQL federation[1] allows combining graphs hosted on different services. Such queries have inherent limitations because some information has to be exchanged between the two services, and knowing what happens when a federated query runs will help to better understand those limitations.

TL/DR: you can expect 1,000,000 Wikidata URIs to take around 10 seconds when part of intermediate results if these are transmitted from the federated service back to the host service. If transmitted from the host service to the federated service this becomes way slower where 100,000 URIs might take 40 seconds to join in that way. Please read further to understand why.

How federation works under the hood

edit

The first step that has to happen is to determine what the shared variables are.

SELECT ?author ?authorLabel ?publicationDate {
  ?author rdfs:label ?authorLabel .
  FILTER (LANG(?authorLabel) = 'en')

  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    wd:Q74426266 wdt:P50 ?author ;
                 wdt:P577 ?publicationDate .
  }
}

In the above examples ?author is used in a triple pattern[2] on both the host and federated service, and ?publicationDate is only used in the SELECT section of the host query and a triple pattern of the federated query. This means that:

  • ?author: can be bound[3] from either the host or the federated service
  • ?publicationDate: can be bound only from the federated service

The difficulty is that SPARQL does not provide a way to explicitly specify what part should bind such a variable first. Should ?author be bound first using ?author rdfs:label ?authorLabel or should it go to the federated endpoint and bind it with ?author rdfs:label ?authorLabel? A trained eye will probably rapidly identify that running the former (?author rdfs:label ?authorLabel) is going to load way too many triples from the backend. Sadly the SPARQL specification does not provide any ways to tell the triple store what to run first and it is very probable that depending on what is run first the query will timeout because the results to transfer are too numerous.

If the federated query is run first it would look like this:

SELECT ?author ?publicationDate WHERE {
    wd:Q74426266 wdt:P50 ?author ;
                 wdt:P577 ?publicationDate .
}

Blazegraph will simply wrap the BGP[4] in a SELECT explicitly selecting the shared variables. The returned results will then help to bind the variables in the main query.

If on the other hand the query is run after it will be rewritten and executed on the federated server as:

SELECT ?author ?publicationDate WHERE {
    wd:Q74426266 wdt:P50 ?author ;
                 wdt:P577 ?publicationDate .
}
VALUES (?author) {
( wd:Q1 )
( wd:Q2 )
( wd:Q3 )
...
}

The ?author bindings are provided via a dedicated VALUES section. Note that the ?publicationDate bindings are not provided since it is not part of a triple pattern of the top query.

The bindings are provided in chunks of 100 values per call. The chunk size can be controlled using the hint:Query hint:chunkSize 1000 query hint.

Controlling the order of operations

edit

Forcing Blazegraph to run the federated query before or after some part of the main query is going to be key to better control how many results are transferred.

Disabling the optimizer

edit

By disabling the optimizer Blazegraph should join the various triple patterns in the order they are presented: hint:Query hint:optimizer "None".

Our example query can be re-organized:

SELECT ?author ?authorLabel ?publicationDate {
  hint:Query hint:optimizer "None" .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    wd:Q74426266 wdt:P50 ?author ;
                 wdt:P577 ?publicationDate .
  }

  ?author rdfs:label ?authorLabel .
  FILTER (LANG(?authorLabel) = 'en')
}

This way the federated query will run first and the the ?author can be bound from the pattern wd:Q74426266 wdt:P50 ?author.

Forcing to run first or last

edit

Another hint is hint:Prior hint:runLast true (or hint:runFirst), which instructs Blazegraph to run the previous pattern last (or first).

Our example query can be re-organized:

SELECT ?author ?authorLabel ?publicationDate {
  ?author rdfs:label ?authorLabel .
  hint:Prior hint:runLast true .
  FILTER (LANG(?authorLabel) = 'en')

  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    wd:Q74426266 wdt:P50 ?author ;
                 wdt:P577 ?publicationDate .
  }
}

This way the broad pattern ?author rdfs:label ?authorLabel is forced to run last (once the ?author is actually bound thanks to the federated query).

Using Blazegraph Named Queries

edit

Blazegraph named queries tend to be run first and could also be an option to better control the order of operations.

SELECT ?author ?authorLabel ?publicationDate
WITH {
  SELECT * {
    SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
       ?paper wdt:P50 ?author ;
              wdt:P577 ?publicationDate .
    }
  }
} AS %fetchauthor WHERE {
  VALUES (?paper) {(wd:Q74426266)}
  include %fetchauthor
  ?author rdfs:label ?authorLabel .
  FILTER (LANG(?authorLabel) = 'en')
}

Limits

edit

There are no hard limits to how many results are pushed from the host service to the federated service (using explicit VALUES bindings) or pulled from it reading the result set. The sole limiting factor is the time allowed for the query to complete (currently 1 minute).

Below is a table of times (in seconds) it takes to transfer some results from the federated service to the host service:

Number of results 1 Id[5] 2 Ids[6] Label[7]
10,000 0.5 0.5 0.5
100,000 1.5 1.8 2.0
1,000,000 10 15 22
2,500,000 22 40 timeout
5,000,000 50 timeout timeout

The times are highly dependent on the size of the information transferred.

Times for the host service to transfer bindings to the federated service depend mainly on the cost of the federated query itself. Various chunk sizes can be tested but 10,000 should be considered a very high value; anything above that is unlikely to yield better performances. Testing a naive join using such directions (host service sending VALUES bindings to the federated service), we can expect at most 100,000 values to be joined in 45 seconds[8].

Common mistakes

edit

Wrapping a federated query with a select

edit

If for some reasons you need to wrap the federated query with a select, beware that Blazegraph will wrap the query again to select the shared variables so if for some reasons you don't select one of the shared variables the behavior of your query might be unexpected.

For example:

SELECT (SUM(?count) AS ?total) {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    SELECT (COUNT(*) AS ?count) {
      ?paper wdt:P921 ?subject .
    }
  }
}

where we naively try to count the number of papers whose main subject (P921) is about a subject related to theoretical physics (Q18362). Note that here the shared variable is ?subject but since we do not select it we break the link and Blazegraph will no longer consider this variable the same and will count all the paper - main subject (P921) pairs. So if aggregations are required in the federated query the shared variable must always be selected.

The way to fix the query above is simply to include ?subject using a GROUP BY:

SELECT (SUM(?count) AS ?total) {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    SELECT ?subject (COUNT(*) AS ?count) {
      ?paper wdt:P921 ?subject .
    } GROUP BY ?subject
  }
}

Returning variables bound by OPTIONAL

edit

You might sometimes want to use an OPTIONAL clause in the federated query. If any of its variables are shared variables extra care must be taken, especially if a variable is also used in a triple pattern of the host query.

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    ?paper wdt:P921 ?subject .
    OPTIONAL { ?paper wdt:P1433 ?venue }
  }
  OPTIONAL {
    ?venue rdfs:label ?venueLabel .
    FILTER (LANG(?venueLabel) = 'en')
  }
}

The above query does extract all articles whose main subject (P921) is about theoretical physics (Q18362) and does return an optional binding ?venue which is then used to optionally fetch its label. The issue is that ?venue may not always be bound (it's optional after all) and thus the triple pattern ?venue rdfs:label ?venueLabel might attempt to bind it. This pattern is particularly broad if not restricted and is likely to return way too many triples.

There is an ugly workaround to this problem if you need to return possibly unbound bindings from the federated query: we can simply always bind it with a sigil (Q1758446) using the COALESCE and BOUND functions:

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    SELECT ?paper (COALESCE(IF(BOUND(?venue), ?venue, '__no_venue__')) AS ?venue) ?subject {
      ?paper wdt:P921 ?subject .
      OPTIONAL { ?paper wdt:P1433 ?venue }
    }
  }
  OPTIONAL {
    ?venue rdfs:label ?venueLabel .
    FILTER (LANG(?venueLabel) = 'en')
  }
}

Note the __no_venue__ sigil that we use when ?venue is not bound that will ensure that the subsequent ?venue rdfs:label ?venueLabel triple pattern can never match.

Misplacing the label service

edit

The label service must be used in the query running on the service that holds the label of the entity.

For instance fetching the label on the host service does not work if the entity is coming from the federated service:

SELECT ?paper ?paperLabel ?author ?authorLabel ?publicationDate {
  VALUES (?paper) {(wd:Q74426266)}
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    ?paper wdt:P50 ?author ;
           wdt:P577 ?publicationDate .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

In the query above we fetch some information (author and publication date) of an item Q74426266, the label service is only present in the top-level query and both the labels of the author and the paper are requested. The entity behind ?paper being hosted on the federated service its label cannot be loaded from the host service.

The solution is to fetch the label from the federated service, and to avoid having to wrap our BGP with a SELECT we can use the BIND function to tell the label service that we are interested in this label:

SELECT ?paper ?paperLabel ?author ?authorLabel ?publicationDate {
  VALUES (?paper) {(wd:Q74426266)}
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    ?paper wdt:P50 ?author ;
           wdt:P577 ?publicationDate .
    BIND(?paperLabel AS ?paperLabel)
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Notes

edit