Wikidata:SPARQL query service/suggestions

A page for suggestions, and reports of strangenesses

See also: All open tickets for WDQS

Meta

edit

Feedback

edit
Action: this page now created
  • It would probably be good to set up a permanent suggestions page on this wiki -- much better than a Google form, because then we can read and think about what everybody else is saying. Jheald (talk) 10:09, 13 September 2015 (UTC)[reply]

Queries

edit
Action:Wikidata:SPARQL_query_service/queries now created

Wishlists

edit

Sourcing analysis

edit
  • Querying sources/references is the biggest benefit of WQS, imho. Including strings/reference URL (P:P854) (Wikidata does have Special:LinkSearch, but...) If i want to consider using wikidata property values in Wikipedia infoboxes or lists, i want to know: Does the data have references/sources? Are 99% of those references the dreaded "imported from wikipedia" type? How many have "real" references and what are they?

--Atlasowa (talk) 08:32, 9 September 2015 (UTC)[reply]

Output

edit

PagePile

edit
https://tools.wmflabs.org/pagepile/ Good one, thanks. --Atlasowa (talk) 08:39, 9 September 2015 (UTC)[reply]
  Done Pagepile's "create an new pile" page now includes an input box that the SPARQL query can be cut-and-pasted to. Magnus's tweet announcing it. Jheald (talk) 10:18, 20 September 2015 (UTC)[reply]

URL shortener

edit

URL shortener when queries time out

edit

Currently the URL shortener button is not presented if a query times out.

One can generate a short URL by cutting-and-pasting from the browser address bar to a URL shortener, but it would be nice if the button was there, in the same way that it is there for successful queries. Jheald (talk) 21:58, 29 September 2015 (UTC)[reply]

  Done --Smalyshev (WMF) (talk) 23:17, 7 March 2016 (UTC)[reply]

Font choice and size

edit

Formatting

edit
  • One change I would like to see is in the output formatting of the results. It would be nice if I could simply copy-and-paste results into a text editor, and have them in a nice tab-separated columns. Instead there's an odd mixture with carriage returns and additional stars that then needs clean-up with a text editor.
  Done This has been done; and there is a new download option box, for results to be downloaded in CSV, TSV or JSON. Jheald (talk) 10:20, 20 September 2015 (UTC)[reply]
Another nice addition to the output would be Reasonator links, for items and dates. Jheald (talk) 10:09, 13 September 2015 (UTC)[reply]
+ links opening by default in a new tab. --- Jura 10:23, 13 September 2015 (UTC)[reply]

Query interface

edit
  • develop a more user-friendly interface for making query. I am now able to modify the queries that were given to me, to change year or language link, but I don't understand how it queries reference, or how to modify the result display. :/ --Hsarrazin (talk) 19:25, 9 September 2015 (UTC)[reply]
For example see "Boolean Graphical Search Learning Tool" Boolify (mobile Boolify, How-to video). --Atlasowa (talk) 09:42, 2 February 2016 (UTC)[reply]

Graphical output

edit

http://demo.seco.tkk.fi/visu/ can produce some rather nice graphical output for a SPARQL query -- including frequency charts, and Google geographical plots. It would be nice to have similar output; or alternatively a button to send a query to run on http://demo.seco.tkk.fi/visu/, which knows how to submit queries to the Wikidata WDQS server, and interpret the results. Jheald (talk) 22:02, 29 September 2015 (UTC) (+github link) (+example query of DBpedia)[reply]

edit

For trying to understand query performance and the query optimisation path taken, it's useful to be able to look at the 'explain' report for the query. (See Wikidata:SPARQL_query_service/query optimization).

This can be accessed by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

-- which is easy enough, if this page is available to cut and paste from. But it would be even easier if there was a button or link to take users to this report. Jheald (talk) 10:50, 2 October 2015 (UTC)[reply]

  Support it should be in "timeout" message directly. d1g (talk) 07:40, 16 May 2017 (UTC)[reply]
  Support --MarioGom (talk) 13:14, 5 March 2020 (UTC)[reply]

Query limits

edit

It would be good, as a community, to try to map out the limits of what is currently possible with the query service -- in particular, for example, how far it is possible to broaden narrow queries before they time out.

In some cases, there may then be ways to re-write the queries, or change their query optimization to allow broader queries to be run, pushing the limits further. (In particular, if query performance suddenly "falls off a cliff" as the query is run for broader groups, as per the example presented on the optimization page, this can be a sign that the built-in optimization strategy has changed, which can sometimes be overcome by specifying a hand-ordering for the query).

In other cases, searching for the limits may reveal strange performance issues, such as some of those listed immediately below.

But it would be good to get an idea, generally, of what is and what is not possible, with the current set-up.

Performance surprises

edit

"It would be great for the community to catalogue the queries that are most important for their use cases that do not do well on the SPARQL endpoint."

-- Mailing list comment from Benjamin Good, [1]

As a preliminary point, I think the same basics as with most other query engines (SQL, etc.) apply:

- non-restrictive queries with tons of results will be slow - i.e. "list of all humans" is probably not a good question to ask :)

- negative searches are usually slower - e.g. "all humans without images" will be slow, since that query would have to inspect records for every human

- Unbound paths/traversals will usually be slower (unfortunately many queries that have TREE in WDQ are those), especially if there are a lot of starting points for traversals (again, "all humans that", etc...)

It is also a good idea to put LIMIT on queries when experimenting, i.e. if you intended to write query that asks for 10 records but accidentally wrote one that returns 10 million, it's much nicer to discover it with suitable limit than waiting for the query to time out and then try to figure out why it happened.

Yes, I realize all this has to go to some page in the manual eventually :)

Query optimizer issues

edit

Blazegraph has a built-in query optimizer which attempts to find the most efficient way to execute a given query from amongst the different possible execution plans.

This often works well, but sometimes it can make the wrong choices. In such cases it may be possible to do better by ordering a query by hand. Valuable information on the optimisation and execution performance of a query can often be obtained from the query engine's 'explain' mode, as described on the query optimization page.

Low initial cardinality can lead to false trails

edit

A basic approach in query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be.

The query optimiser therefore seems typically to re-order a query to start with the most specific assertion first, ie the one with the smallest solution set, and then follow the trail from there. Very often this is exactly the right strategy. However in some cases subsequent joins then greatly multiply the size of the solution set (rather than tending to reduce it), and starting from a different initial assertion may be better. The query optimization page gives and example.

Queries involving sub-queries

edit

Where queries contain sub-queries (ie SELECT clauses inside SELECT clauses), it appears that by default the query optimizer will try to start the sub-query from an assertion that contains one of the variables that will be projected from the inner query to the outer query.

This appears to be the case even if there are other assertions in the sub-query that might have much lower cardinality; and even if the optimiser might optimise the sub-query quite differently if it were standing on its own.

Query optimiser defeated by an optional BIND statement

edit

This query tinyurl.com/nr5e7ch runs in 368ms.

However, if one tries to record which branch of the first UNION has been taken, tinyurl.com/ntenxgw, then the query times out. Jheald (talk) 16:47, 26 September 2015 (UTC)[reply]

Update Query fixed, by changing the order of clauses in the WHERE block: tinyurl.com/puuw4hv. It's not clear why the query optimiser wasn't able to do this itself, when the BIND statements were there -- but it might be because BIND statements can sometimes change the results of subsequent clauses. Jheald (talk) 10:48, 29 September 2015 (UTC)[reply]

There was a previous issue flagged to Blazegraph about optimising around UNIONs (BLZG-1297), in the absence of any BIND statements. That appears now to have been fixed (or at least query tinyurl.com/ob527ow now appears to run without any difficulty; though the bug report at Blazegraph doesn't appear to have been formally closed). But it seems that success is not the case if BIND statements are present. Jheald (talk) 15:49, 2 October 2015 (UTC)[reply]

Path assertions that end in a variable can blow up

edit

Such path searches fail even if the variable is a constant that has been set with a BIND statement

edit

This query runs, in about 0.3 seconds

SELECT (COUNT(?item) AS ?count) WHERE {
   ?item wdt:P31/wdt:P279* wd:Q3305213 . # paintings
}
Try it!

However, if BIND is used, the equivalent query times out

SELECT (COUNT(DISTINCT ?a) AS ?count) WHERE {
    BIND (wd:Q3305213 AS ?class) .
    ?a wdt:P31/wdt:P279* ?class .
}
Try it!

Other anomalies

edit

In other cases too, small changes in the detailed specification of path searches can make considerable differences -- compare the before and after versions in this diff, which on the face of it should be entirely equivalent.

General performance of path queries

edit

Even in very simple graphs, the path query takes longer:

The following query asks for all male first names 'said to be the same as' Charles:

  • tinyurl.com/qeua4h4

Allowing 3 hops, it finds 14 names (the same as with 2 hops) in 0.289 seconds

However, using the path search instead takes 0.507 seconds -- almost twice as long.

  • tinyurl.com/pmcrqn2

Can anything be done to make these path queries quicker -- especially when combining wdt:P31/wdt:P249* results with other conditions? It is very disappointing, especially when considering (a) how good Magnus's WDQ is at queries of this type; and (b) how pervasive an idea it is to "refine subclasses" to group items together more precisely, making searches involving wdt:P31/wdt:P249* essential to retrieve the full umbrella set.

As reported earlier at Project Chat, even simple counts can be far slower in the SPARQL service than with WDQ. For example, here's a query to count items in the sub-class tree of occurrence (Q1190554), using WDQ and using the new WQS beta. The WQS is over 30 times slower, taking 2164 ms, compared to 63ms for WDQ. Jheald (talk) 18:37, 24 September 2015 (UTC)[reply]

OPTIONAL clause makes things much slower

edit

The following query for isotopes and their decay modes is fast: tinyurl.com/gpe57wz- 700 results in 173 ms. However, making the decay mode piece optional: tinyurl.com/hs8ld7n fails with a limit of 700, and takes 9582 ms for 500 results. Any explanations? Searching without the decay modes piece of the query at all is pretty speedy too even with no limit on results: tinyurl.com/h4mbac9 - 4736 results in 283 ms. Something about OPTIONAL is causing trouble here I think? ArthurPSmith (talk) 18:34, 16 February 2016 (UTC)[reply]

Note that replacing the OPTIONAL clause with a UNION makes it much faster again. ArthurPSmith (talk) 20:33, 16 February 2016 (UTC)[reply]
The first thing to do is to check the output of the 'explain' option for the query, as explained at Wikidata:SPARQL query service/query optimization, to see what sequence of commands the engine is actually executing in each case, after the query optimizer has re-arranged it.
My suspicion is that without the "optional" clause, it may be that the system is looking up the decays to (P816) statement first, which may be (a) quicker and (b) give a smaller initial solution set, than looking up the wdt:P31/wdt:P279* wd:Q25276 first.
But I could be completely wrong. Jheald (talk) 23:31, 16 February 2016 (UTC)[reply]
Note that this is now fixed (not a problem at all) with the current wikidata query service. ArthurPSmith (talk) 15:02, 11 March 2016 (UTC)[reply]

Example queries that don't work any more

edit

Other strange behaviour

edit

MINUS strangeness

edit

In this query tinyurl.com/nqawlxp the query service appears to be ignoring the keyword MINUS.

The query is supposed to return co-ordinates for places in Paris which do not have more than one specified set of co-ordinates. Instead, the query is only returning places which have more than one set.

It's easy enough to work round (eg by making a positive clause tinyurl.com/otlzwdx), but I don't understand why my original query is being interpreted this way. Jheald (talk) 11:34, 29 September 2015 (UTC)[reply]

Geographical query can throw a "ConcurrentModificationException" error

edit

As described in the Wikidata:SPARQL_query_service/queries#Working_with_co-ordinates section, certain broad searches on co-ordinates seem to be able to throw a "ConcurrentModificationException" error. Jheald (talk) 12:43, 5 October 2015 (UTC)[reply]

See also Wikidata:SPARQL_query_service/query_limits#Search_within_box for more occurrence data.

Variables bound from GROUP_CONCAT in inner SELECT not visible to outer SELECT

edit

A list of values gathered together by a (GROUP_CONCAT(DISTINCT(?rkd); separator = "; ") AS ?rkd_list) in an inner SELECT statement appears not to be visible to the outer SELECT. See bug report for a full example. Jheald (talk) 12:00, 9 October 2015 (UTC)[reply]

ORDER BY doesn't work well with items and null values

edit
SELECT ?place ?placeLabel ?event ?eventLabel ?ref ?refp ?refv
WHERE
{
    wd:Q28922083    p:P17            ?s     .
      ?s           ps:P17            ?place .
      ?s           pq:P793           ?event .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} ORDER BY DESC(?event)
Try it!

Weird sorting (wd:Q38087, wd:Q29278, ... , wd:Q86591, ...) persist with following modifications:

label service returns t385555272 t385555269 for missing values

edit

Same query as in section above (ORDER BY doesn't work well) d1g (talk) 11:41, 11 March 2017 (UTC)[reply]

edit

When I access the query service using a tool I downloaded, the tool complains about unclosed link elements. Looking at the response, I do see unclosed link elements. Although many tools accept unclosed link elements, shouldn't the query service be ultra-complaint? Peter F. Patel-Schneider (talk) 03:42, 8 May 2018 (UTC)[reply]

Feature requests

edit

Label service should not fail when given a blank to find the label for

edit
  • eg for a field from an OPTIONAL clause, that may only filled in sometimes.
(Example from the mailing list: https://lists.wikimedia.org/pipermail/wikidata/2015-September/007082.html]).
Some comments by User:Smalyshev (WMF) on the label service: [5]
The issue can be worked-around with code like
OPTIONAL { 
     ?item rdfs:label ?item_label filter (lang(?item_label) = "en"). 
}

Label service for properties in the wdt: namespace

edit

It would be nice if the labels service could also find labels for properties in the wdt: namespace.

The issue can be worked round by adding a line like

    ?prop wikibase:claim ?p

to join the corresponding entity ?prop in the wd: namespace into the query, and the label service can then generate a label ?propLabel.

However, as described on the query optimization page, the query optimizer finds ?prop wikibase:claim ?p an unfortunately seductive starting-point for queries. The statement has an initial solution set of only about 2000 entries, but starting from here the query will (usually) then blow up when the engine tries to attach the next statement, typically causing the query to time out. To avoid this requires that the optimizer is turned off and the query be explicitly hand-ordered.

Turning off the optimizer is all very well, but it increases the level of knowledge required, and there are some queries that are not easy to hand-order well, without the built-in optimizer. Jheald (talk) 08:11, 16 September 2015 (UTC)[reply]

Sorting sequence for entities in the wd: namespace

edit

@Smalyshev (WMF): Is there a sorting order for entities in the SPARQL service ? Could you add one ?

For exampe, consider the following query:

tinyurl.com/nwc47gj

In the query ?fname is a firstname, like Dave (Q16275316), and ?altname is an alternate name said to be the same as (P460) as the firstname, eg David (Q18057751).

I would like to be able to canonicalise firstnames to find the firstname with the lowest Q-number, (to be called ?name) -- eg to count the number of 'families' of firstnames, or count each family's total incidence.

However, it seems although I can find a sample alternative name (using SAMPLE), I can't reliably create a canonical one (using MIN) -- the query engine throws a Vocab(2) exception. I presume this means that there's no sequence order been declared for wikibase:Item

Is this right? and/or is there a good way round this? Jheald (talk) 14:39, 11 September 2015 (UTC)[reply]

With some help from User:Jura1 I created this tinyurl.com/n9pfqno to count the number of variants in a group of similar first-names, which works by canonicalising to a number, which one then can find the MIN of, and then in the outer loop finding the item that maps to that number.
But it still seems that it would be useful to be able to calculate the MIN of a group of items directly. Jheald (talk) 22:24, 16 September 2015 (UTC)[reply]
This is tricky. In general, there's no order on RDF entities. So MIN is not well-defined, unless we would compare lexicographically as strings. I'll research what are the existing practices in this area. --Smalyshev (WMF) (talk) 19:10, 22 September 2015 (UTC)[reply]
edit

One thing that would be nice, perhaps in the longer term, would be to be able to attach pageview statistics from the corresponding Wikipedia articles to the results of queries -- making it possible to give answers to questions like "What were the top 20 battles fought in the UK, by pageviews", or "top 10 U.S. presidents by pageviews", or "who was born on this day tinyurl.com/p6ry6y7 - by pageviews" or whatever. I believe Histropedia already does something like this, to decide on the most likely-to-be-interesting top n events to put on its timelines -- so the timelines show the most high-profile events at the widest zoom, then more and more "detail" events as one zooms in. It would be nice (eventually) to make this sort of popularity-based choice also available for ranking for general queries. Jheald (talk) 19:13, 24 September 2015 (UTC)[reply]

Access to other Mediawiki databases from the SPARQL service

edit

There are other WMF databases, beyond the page-hits database, that it would be nice to be able to gateway to the SPARQL service, ie so that they could be made visible (using FROM ?) for a query if desired. (I have read that SPARQL is in fact quite often used to interrogate conventional relational databases, and that quite a lot of RDBMS <-> SPARQL bridging software exists, because the flat table of triples approach of SPARQL is sometimes found the easiest way to bring together information from different inherited relational databases with very different table design).

If SPARQL could access the main mediawiki table-replicates on Labs, one might then be able to write queries that eg for a particular Q-number for a Wikimedia category could find the Q-numbers for all the parent categories across every language (or a choice of languages), and then identify whether the subcategory relationship could be "explained" by a short chain of properties connecting the corresponding category's main topic (P301) items. Of course this could be done by a tool interrogating the different databases separately, but it would be nice if it could all be done from within a single query in SPARQL.

I am sure that others could add other examples of information in the mediawiki SQL databases that they like to be able to return or act on in SPARQL queries. Jheald (talk) 08:14, 29 September 2015 (UTC)[reply]

Apparently the package D2RQ can act as a middleware layer between queries running on a SPARQL query engine and a relational database -- it makes it possible for the RDB to appear as an external SERVICE that sparql sub-queries can be sent to, with the results then returned to the main query in the same way as a federated query. Jheald (talk) 14:16, 2 October 2015 (UTC)[reply]

Fast searching of descriptions

edit

The request was made on Project Chat:

"Is there any (easy) way to search for some string in the description field of items only in one particular language?"

It would be nice to be able to do this in SPARQL, eg with a query like

PREFIX schema: <http://schema.org/>

SELECT ?q ?desc WHERE {
    ?q schema:description ?desc .
    FILTER (STRSTARTS(?desc, 'langue'))    
    filter (lang(?desc) = "fr") .
}
Try it!

or better

PREFIX schema: <http://schema.org/>

SELECT ?q ?desc WHERE {
    ?q schema:description ?desc .
    FILTER (STRSTARTS(LCASE(?desc), 'langue'))     # Could also use CONTAINS()
    filter (lang(?desc) = "fr") .
}
Try it!

I thought the SPARQL service was supposed to have fast indexing and range extraction on predicate values, so if the query optimizer can identify the above as range queries, it ought to be possible for them to be very fast.

However they time out. (By comparison, WDQ can do this). Jheald (talk) 09:40, 7 October 2015 (UTC)[reply]

Hide days and months in timeline view

edit

Is it possible to hide the days and/or months in the timeline view? I'm generating timelines from data where only the year is known. The display shows each year as January 1, ____. I realise that HistropediaJS gives me much more control of date precisions, but it would be nice if there were a #hide-like option in the timeline view to only show the year. MartinPoulter (talk) 11:48, 3 October 2018 (UTC)[reply]

Having thought about this more, I realise it is possible: just hide the date field and create a separate field with just the year. No changes needed! MartinPoulter (talk) 12:14, 19 October 2018 (UTC)[reply]

Creating "is not null" filter

edit

How do you specify a filter so that it doesn't return null results for that property? Thank you. Trilotat (talk) 03:09, 15 December 2018 (UTC)[reply]


Ability to export map

edit

Being able to see a map in the UI is great, but it would be great to be able to download such a map without screenshotting and cropping out overlays. Svízel přítula (talk) 11:00, 13 September 2020 (UTC)[reply]