Wikidata:SPARQL query service/suggestions

A page for suggestions, and reports of strangenesses

Wishlists

Sourcing analysis

Querying sources/references is the biggest benefit of WQS, imho. Including strings/reference URL (P:P854) (Wikidata does have Special:LinkSearch, but...) If i want to consider using wikidata property values in Wikipedia infoboxes or lists, i want to know: Does the data have references/sources? Are 99% of those references the dreaded "imported from wikipedia" type? How many have "real" references and what are they?

--Atlasowa (talk) 08:32, 9 September 2015 (UTC)[reply]

See Wikidata:Project_chat/Archive/2015/09#WQS:_Searching_for_items_with_reference_to_"Le_Figaro" --Atlasowa (talk) 13:40, 22 September 2015 (UTC)[reply]
Summarised at Wikidata:SPARQL_query_service/queries#Working_with_references (plus a search for website attribution that now works) Jheald (talk) 10:26, 29 September 2015 (UTC)[reply]

Output

PagePile

An option for outputting to "PagePile" might be good to have. --- Jura 07:40, 9 September 2015 (UTC)[reply]

https://tools.wmflabs.org/pagepile/ Good one, thanks. --Atlasowa (talk) 08:39, 9 September 2015 (UTC)[reply]

Done Pagepile's "create an new pile" page now includes an input box that the SPARQL query can be cut-and-pasted to. Magnus's tweet announcing it. Jheald (talk) 10:18, 20 September 2015 (UTC)[reply]

URL shortener

Tracked in Phabricator
Task T112715

Don't provide URL-shorteners like this: http:// tinyurl.com/nmkee2z (Can't link directly to the URL generated by the 'short URL' button on the query page, because the URL-shortener is blocked by the wiki software). But providing perma-links to queries is a must-have. --Atlasowa (talk) 09:13, 9 September 2015 (UTC)[reply]
- See also Wikidata:Project_chat/Archive/2015/09#URL shortener? and http://urlshortener.wmflabs.org --Atlasowa (talk) 22:06, 20 September 2015 (UTC)[reply]

URL shortener when queries time out

Currently the URL shortener button is not presented if a query times out.

One can generate a short URL by cutting-and-pasting from the browser address bar to a URL shortener, but it would be nice if the button was there, in the same way that it is there for successful queries. Jheald (talk) 21:58, 29 September 2015 (UTC)[reply]

Done --Smalyshev (WMF) (talk) 23:17, 7 March 2016 (UTC)[reply]

Font choice and size

change the font and size of the result list : on my screen it is very tiring and difficult to click on very small letters. - a sans-serif would be better. --Hsarrazin (talk) 19:25, 9 September 2015 (UTC)[reply]

Formatting

One change I would like to see is in the output formatting of the results. It would be nice if I could simply copy-and-paste results into a text editor, and have them in a nice tab-separated columns. Instead there's an odd mixture with carriage returns and additional stars that then needs clean-up with a text editor.

Done This has been done; and there is a new download option box, for results to be downloaded in CSV, TSV or JSON. Jheald (talk) 10:20, 20 September 2015 (UTC)[reply]

Another nice addition to the output would be Reasonator links, for items and dates. Jheald (talk) 10:09, 13 September 2015 (UTC)[reply]

+ links opening by default in a new tab. --- Jura 10:23, 13 September 2015 (UTC)[reply]

Query interface

develop a more user-friendly interface for making query. I am now able to modify the queries that were given to me, to change year or language link, but I don't understand how it queries reference, or how to modify the result display. :/ --Hsarrazin (talk) 19:25, 9 September 2015 (UTC)[reply]

For example see "Boolean Graphical Search Learning Tool" Boolify (mobile Boolify, How-to video). --Atlasowa (talk) 09:42, 2 February 2016 (UTC)[reply]

Have a logo! See Wikidata_talk:SPARQL_query_service#Logo? --Atlasowa (talk) 14:53, 23 September 2015 (UTC)[reply]

Graphical output

http://demo.seco.tkk.fi/visu/ can produce some rather nice graphical output for a SPARQL query -- including frequency charts, and Google geographical plots. It would be nice to have similar output; or alternatively a button to send a query to run on http://demo.seco.tkk.fi/visu/, which knows how to submit queries to the Wikidata WDQS server, and interpret the results. Jheald (talk) 22:02, 29 September 2015 (UTC) (+github link) (+example query of DBpedia)[reply]

Link to 'explain' output

For trying to understand query performance and the query optimisation path taken, it's useful to be able to look at the 'explain' report for the query. (See Wikidata:SPARQL_query_service/query optimization).

This can be accessed by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

-- which is easy enough, if this page is available to cut and paste from. But it would be even easier if there was a button or link to take users to this report. Jheald (talk) 10:50, 2 October 2015 (UTC)[reply]

Support it should be in "timeout" message directly. d1g (talk) 07:40, 16 May 2017 (UTC)[reply]

Support --MarioGom (talk) 13:14, 5 March 2020 (UTC)[reply]

Query limits

It would be good, as a community, to try to map out the limits of what is currently possible with the query service -- in particular, for example, how far it is possible to broaden narrow queries before they time out.

In some cases, there may then be ways to re-write the queries, or change their query optimization to allow broader queries to be run, pushing the limits further. (In particular, if query performance suddenly "falls off a cliff" as the query is run for broader groups, as per the example presented on the optimization page, this can be a sign that the built-in optimization strategy has changed, which can sometimes be overcome by specifying a hand-ordering for the query).

In other cases, searching for the limits may reveal strange performance issues, such as some of those listed immediately below.

But it would be good to get an idea, generally, of what is and what is not possible, with the current set-up.

Performance surprises

"It would be great for the community to catalogue the queries that are most important for their use cases that do not do well on the SPARQL endpoint."

-- Mailing list comment from Benjamin Good, [1]

Some broad initial orientation from User:Smalyshev (WMF): [2]:

As a preliminary point, I think the same basics as with most other query engines (SQL, etc.) apply:
- non-restrictive queries with tons of results will be slow - i.e. "list of all humans" is probably not a good question to ask :)
- negative searches are usually slower - e.g. "all humans without images" will be slow, since that query would have to inspect records for every human
- Unbound paths/traversals will usually be slower (unfortunately many queries that have TREE in WDQ are those), especially if there are a lot of starting points for traversals (again, "all humans that", etc...)
It is also a good idea to put LIMIT on queries when experimenting, i.e. if you intended to write query that asks for 10 records but accidentally wrote one that returns 10 million, it's much nicer to discover it with suitable limit than waiting for the query to time out and then try to figure out why it happened.
Yes, I realize all this has to go to some page in the manual eventually :)

Query optimizer issues

Blazegraph has a built-in query optimizer which attempts to find the most efficient way to execute a given query from amongst the different possible execution plans.

This often works well, but sometimes it can make the wrong choices. In such cases it may be possible to do better by ordering a query by hand. Valuable information on the optimisation and execution performance of a query can often be obtained from the query engine's 'explain' mode, as described on the query optimization page.

Low initial cardinality can lead to false trails

A basic approach in query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be.

The query optimiser therefore seems typically to re-order a query to start with the most specific assertion first, ie the one with the smallest solution set, and then follow the trail from there. Very often this is exactly the right strategy. However in some cases subsequent joins then greatly multiply the size of the solution set (rather than tending to reduce it), and starting from a different initial assertion may be better. The query optimization page gives and example.

Queries involving sub-queries

Tracked by Blazegraph

BLZG-1483

Where queries contain sub-queries (ie SELECT clauses inside SELECT clauses), it appears that by default the query optimizer will try to start the sub-query from an assertion that contains one of the variables that will be projected from the inner query to the outer query.

This appears to be the case even if there are other assertions in the sub-query that might have much lower cardinality; and even if the optimiser might optimise the sub-query quite differently if it were standing on its own.

Query optimiser defeated by an optional BIND statement

Tracked by Blazegraph

BLZG-1541

This query tinyurl.com/nr5e7ch runs in 368ms.

However, if one tries to record which branch of the first UNION has been taken, tinyurl.com/ntenxgw, then the query times out. Jheald (talk) 16:47, 26 September 2015 (UTC)[reply]

Update Query fixed, by changing the order of clauses in the WHERE block: tinyurl.com/puuw4hv. It's not clear why the query optimiser wasn't able to do this itself, when the BIND statements were there -- but it might be because BIND statements can sometimes change the results of subsequent clauses. Jheald (talk) 10:48, 29 September 2015 (UTC)[reply]

Tracked by Blazegraph

BLZG-1297

Resolved

There was a previous issue flagged to Blazegraph about optimising around UNIONs (BLZG-1297), in the absence of any BIND statements. That appears now to have been fixed (or at least query tinyurl.com/ob527ow now appears to run without any difficulty; though the bug report at Blazegraph doesn't appear to have been formally closed). But it seems that success is not the case if BIND statements are present. Jheald (talk) 15:49, 2 October 2015 (UTC)[reply]

Path assertions that end in a variable can blow up

Tracked by Blazegraph

BLZG-1543

Such path searches fail even if the variable is a constant that has been set with a BIND statement

This query runs, in about 0.3 seconds

SELECT (COUNT(?item) AS ?count) WHERE {
   ?item wdt:P31/wdt:P279* wd:Q3305213 . # paintings
}

Try it!

However, if BIND is used, the equivalent query times out

SELECT (COUNT(DISTINCT ?a) AS ?count) WHERE {
    BIND (wd:Q3305213 AS ?class) .
    ?a wdt:P31/wdt:P279* ?class .
}

Try it!

Other anomalies

In other cases too, small changes in the detailed specification of path searches can make considerable differences -- compare the before and after versions in this diff, which on the face of it should be entirely equivalent.

General performance of path queries

Even in very simple graphs, the path query takes longer:

The following query asks for all male first names 'said to be the same as' Charles:

tinyurl.com/qeua4h4

Allowing 3 hops, it finds 14 names (the same as with 2 hops) in 0.289 seconds

However, using the path search instead takes 0.507 seconds -- almost twice as long.

tinyurl.com/pmcrqn2

Can anything be done to make these path queries quicker -- especially when combining wdt:P31/wdt:P249* results with other conditions? It is very disappointing, especially when considering (a) how good Magnus's WDQ is at queries of this type; and (b) how pervasive an idea it is to "refine subclasses" to group items together more precisely, making searches involving wdt:P31/wdt:P249* essential to retrieve the full umbrella set.

As reported earlier at Project Chat, even simple counts can be far slower in the SPARQL service than with WDQ. For example, here's a query to count items in the sub-class tree of occurrence (Q1190554), using WDQ and using the new WQS beta. The WQS is over 30 times slower, taking 2164 ms, compared to 63ms for WDQ. Jheald (talk) 18:37, 24 September 2015 (UTC)[reply]

OPTIONAL clause makes things much slower

The following query for isotopes and their decay modes is fast: tinyurl.com/gpe57wz- 700 results in 173 ms. However, making the decay mode piece optional: tinyurl.com/hs8ld7n fails with a limit of 700, and takes 9582 ms for 500 results. Any explanations? Searching without the decay modes piece of the query at all is pretty speedy too even with no limit on results: tinyurl.com/h4mbac9 - 4736 results in 283 ms. Something about OPTIONAL is causing trouble here I think? ArthurPSmith (talk) 18:34, 16 February 2016 (UTC)[reply]

Note that replacing the OPTIONAL clause with a UNION makes it much faster again. ArthurPSmith (talk) 20:33, 16 February 2016 (UTC)[reply]

The first thing to do is to check the output of the 'explain' option for the query, as explained at Wikidata:SPARQL query service/query optimization, to see what sequence of commands the engine is actually executing in each case, after the query optimizer has re-arranged it.

My suspicion is that without the "optional" clause, it may be that the system is looking up the decays to (P816) statement first, which may be (a) quicker and (b) give a smaller initial solution set, than looking up the wdt:P31/wdt:P279* wd:Q25276 first.

But I could be completely wrong. Jheald (talk) 23:31, 16 February 2016 (UTC)[reply]

Note that this is now fixed (not a problem at all) with the current wikidata query service. ArthurPSmith (talk) 15:02, 11 March 2016 (UTC)[reply]

Example queries that don't work any more

Wikidata scientific articles that contain "zika" in the title — worked fine for months at least, but stopped somewhen late last year. --Daniel Mietchen (talk) 22:42, 16 March 2017 (UTC)[reply]
Authors of scientific articles who received a Nobel prize — worked fine in October, but is broken now. --Daniel Mietchen (talk) 22:42, 16 March 2017 (UTC)[reply]

Other strange behaviour

MINUS strangeness

Tracked by Blazegraph

BLZG-1542

In this query tinyurl.com/nqawlxp the query service appears to be ignoring the keyword MINUS.

The query is supposed to return co-ordinates for places in Paris which do not have more than one specified set of co-ordinates. Instead, the query is only returning places which have more than one set.

It's easy enough to work round (eg by making a positive clause tinyurl.com/otlzwdx), but I don't understand why my original query is being interpreted this way. Jheald (talk) 11:34, 29 September 2015 (UTC)[reply]

Geographical query can throw a "ConcurrentModificationException" error

Tracked by Blazegraph

BLZG-1549

As described in the Wikidata:SPARQL_query_service/queries#Working_with_co-ordinates section, certain broad searches on co-ordinates seem to be able to throw a "ConcurrentModificationException" error. Jheald (talk) 12:43, 5 October 2015 (UTC)[reply]

See also Wikidata:SPARQL_query_service/query_limits#Search_within_box for more occurrence data.

Variables bound from GROUP_CONCAT in inner SELECT not visible to outer SELECT

Tracked by Blazegraph

BLZG-1564

A list of values gathered together by a (GROUP_CONCAT(DISTINCT(?rkd); separator = "; ") AS ?rkd_list) in an inner SELECT statement appears not to be visible to the outer SELECT. See bug report for a full example. Jheald (talk) 12:00, 9 October 2015 (UTC)[reply]

ORDER BY doesn't work well with items and null values

SELECT ?place ?placeLabel ?event ?eventLabel ?ref ?refp ?refv
WHERE
{
    wd:Q28922083    p:P17            ?s     .
      ?s           ps:P17            ?place .
      ?s           pq:P793           ?event .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} ORDER BY DESC(?event)

Try it!

Weird sorting (wd:Q38087, wd:Q29278, ... , wd:Q86591, ...) persist with following modifications:

ASC/DESC
DESC(xsd:integer(?event)) is weird when compared to DESC(xsd:string(?event))
enable/disable labeling service d1g (talk) 11:39, 11 March 2017 (UTC)[reply]

label service returns t385555272 t385555269 for missing values

Same query as in section above (ORDER BY doesn't work well) d1g (talk) 11:41, 11 March 2017 (UTC)[reply]

Unclosed link elements

When I access the query service using a tool I downloaded, the tool complains about unclosed link elements. Looking at the response, I do see unclosed link elements. Although many tools accept unclosed link elements, shouldn't the query service be ultra-complaint? Peter F. Patel-Schneider (talk) 03:42, 8 May 2018 (UTC)[reply]

Feature requests

Label service should not fail when given a blank to find the label for

Tracked in Phabricator
Task T113374

eg for a field from an OPTIONAL clause, that may only filled in sometimes.

(Example from the mailing list: https://lists.wikimedia.org/pipermail/wikidata/2015-September/007082.html]).

Some comments by User:Smalyshev (WMF) on the label service: [5]

The issue can be worked-around with code like

OPTIONAL { 
     ?item rdfs:label ?item_label filter (lang(?item_label) = "en"). 
}

Label service for properties in the `wdt:` namespace

It would be nice if the labels service could also find labels for properties in the wdt: namespace.

The issue can be worked round by adding a line like

    ?prop wikibase:claim ?p

to join the corresponding entity ?prop in the wd: namespace into the query, and the label service can then generate a label ?propLabel.

However, as described on the query optimization page, the query optimizer finds ?prop wikibase:claim ?p an unfortunately seductive starting-point for queries. The statement has an initial solution set of only about 2000 entries, but starting from here the query will (usually) then blow up when the engine tries to attach the next statement, typically causing the query to time out. To avoid this requires that the optimizer is turned off and the query be explicitly hand-ordered.

Turning off the optimizer is all very well, but it increases the level of knowledge required, and there are some queries that are not easy to hand-order well, without the built-in optimizer. Jheald (talk) 08:11, 16 September 2015 (UTC)[reply]

Sorting sequence for entities in the `wd:` namespace

@Smalyshev (WMF): Is there a sorting order for entities in the SPARQL service ? Could you add one ?

For exampe, consider the following query:

tinyurl.com/nwc47gj

In the query ?fname is a firstname, like Dave (Q16275316), and ?altname is an alternate name said to be the same as (P460) as the firstname, eg David (Q18057751).

I would like to be able to canonicalise firstnames to find the firstname with the lowest Q-number, (to be called ?name) -- eg to count the number of 'families' of firstnames, or count each family's total incidence.

However, it seems although I can find a sample alternative name (using SAMPLE), I can't reliably create a canonical one (using MIN) -- the query engine throws a Vocab(2) exception. I presume this means that there's no sequence order been declared for wikibase:Item

Is this right? and/or is there a good way round this? Jheald (talk) 14:39, 11 September 2015 (UTC)[reply]

With some help from User:Jura1 I created this tinyurl.com/n9pfqno to count the number of variants in a group of similar first-names, which works by canonicalising to a number, which one then can find the MIN of, and then in the outer loop finding the item that maps to that number.

But it still seems that it would be useful to be able to calculate the MIN of a group of items directly. Jheald (talk) 22:24, 16 September 2015 (UTC)[reply]

This is tricky. In general, there's no order on RDF entities. So MIN is not well-defined, unless we would compare lexicographically as strings. I'll research what are the existing practices in this area. --Smalyshev (WMF) (talk) 19:10, 22 September 2015 (UTC)[reply]

Related pageview statistics

One thing that would be nice, perhaps in the longer term, would be to be able to attach pageview statistics from the corresponding Wikipedia articles to the results of queries -- making it possible to give answers to questions like "What were the top 20 battles fought in the UK, by pageviews", or "top 10 U.S. presidents by pageviews", or "who was born on this day tinyurl.com/p6ry6y7 - by pageviews" or whatever. I believe Histropedia already does something like this, to decide on the most likely-to-be-interesting top n events to put on its timelines -- so the timelines show the most high-profile events at the widest zoom, then more and more "detail" events as one zooms in. It would be nice (eventually) to make this sort of popularity-based choice also available for ranking for general queries. Jheald (talk) 19:13, 24 September 2015 (UTC)[reply]

Access to other Mediawiki databases from the SPARQL service

There are other WMF databases, beyond the page-hits database, that it would be nice to be able to gateway to the SPARQL service, ie so that they could be made visible (using FROM ?) for a query if desired. (I have read that SPARQL is in fact quite often used to interrogate conventional relational databases, and that quite a lot of RDBMS <-> SPARQL bridging software exists, because the flat table of triples approach of SPARQL is sometimes found the easiest way to bring together information from different inherited relational databases with very different table design).

If SPARQL could access the main mediawiki table-replicates on Labs, one might then be able to write queries that eg for a particular Q-number for a Wikimedia category could find the Q-numbers for all the parent categories across every language (or a choice of languages), and then identify whether the subcategory relationship could be "explained" by a short chain of properties connecting the corresponding category's main topic (P301) items. Of course this could be done by a tool interrogating the different databases separately, but it would be nice if it could all be done from within a single query in SPARQL.

I am sure that others could add other examples of information in the mediawiki SQL databases that they like to be able to return or act on in SPARQL queries. Jheald (talk) 08:14, 29 September 2015 (UTC)[reply]

Apparently the package D2RQ can act as a middleware layer between queries running on a SPARQL query engine and a relational database -- it makes it possible for the RDB to appear as an external SERVICE that sparql sub-queries can be sent to, with the results then returned to the main query in the same way as a federated query. Jheald (talk) 14:16, 2 October 2015 (UTC)[reply]

Fast searching of descriptions

The request was made on Project Chat:

"Is there any (easy) way to search for some string in the description field of items only in one particular language?"

It would be nice to be able to do this in SPARQL, eg with a query like

PREFIX schema: <http://schema.org/>

SELECT ?q ?desc WHERE {
    ?q schema:description ?desc .
    FILTER (STRSTARTS(?desc, 'langue'))    
    filter (lang(?desc) = "fr") .
}

Try it!

or better

PREFIX schema: <http://schema.org/>

SELECT ?q ?desc WHERE {
    ?q schema:description ?desc .
    FILTER (STRSTARTS(LCASE(?desc), 'langue'))     # Could also use CONTAINS()
    filter (lang(?desc) = "fr") .
}

Try it!

I thought the SPARQL service was supposed to have fast indexing and range extraction on predicate values, so if the query optimizer can identify the above as range queries, it ought to be possible for them to be very fast.

However they time out. (By comparison, WDQ can do this). Jheald (talk) 09:40, 7 October 2015 (UTC)[reply]

Hide days and months in timeline view

Is it possible to hide the days and/or months in the timeline view? I'm generating timelines from data where only the year is known. The display shows each year as January 1, ____. I realise that HistropediaJS gives me much more control of date precisions, but it would be nice if there were a #hide-like option in the timeline view to only show the year. MartinPoulter (talk) 11:48, 3 October 2018 (UTC)[reply]

Having thought about this more, I realise it is possible: just hide the date field and create a separate field with just the year. No changes needed! MartinPoulter (talk) 12:14, 19 October 2018 (UTC)[reply]

Creating "is not null" filter

How do you specify a filter so that it doesn't return null results for that property? Thank you. Trilotat (talk) 03:09, 15 December 2018 (UTC)[reply]

Ability to export map

Being able to see a map in the UI is great, but it would be great to be able to download such a map without screenshotting and cropping out overlays. Svízel přítula (talk) 11:00, 13 September 2020 (UTC)[reply]

Wikidata:SPARQL query service/suggestions

Meta

Feedback

Queries

Wishlists

Sourcing analysis

Output

PagePile

URL shortener

URL shortener when queries time out

Font choice and size

Formatting

Query interface

Graphical output

Link to 'explain' output

Query limits

Performance surprises

Query optimizer issues

Low initial cardinality can lead to false trails

Queries involving sub-queries

Query optimiser defeated by an optional BIND statement

Path assertions that end in a variable can blow up

Such path searches fail even if the variable is a constant that has been set with a BIND statement

Other anomalies

General performance of path queries

OPTIONAL clause makes things much slower

Example queries that don't work any more

Other strange behaviour

MINUS strangeness

Geographical query can throw a "ConcurrentModificationException" error

Variables bound from GROUP_CONCAT in inner SELECT not visible to outer SELECT

ORDER BY doesn't work well with items and null values

label service returns t385555272 t385555269 for missing values

Unclosed link elements

Feature requests

Label service should not fail when given a blank to find the label for

Label service for properties in the wdt: namespace

Sorting sequence for entities in the wd: namespace

Related pageview statistics

Access to other Mediawiki databases from the SPARQL service

Fast searching of descriptions

Hide days and months in timeline view

Creating "is not null" filter

Ability to export map

Label service for properties in the `wdt:` namespace

Sorting sequence for entities in the `wd:` namespace