ウィキデータ:SPARQLチュートリアル

This page is a translated version of the page Wikidata:SPARQL tutorial and the translation is 44% complete.
Outdated translations are marked like this.
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Türkçe • ‎català • ‎dansk • ‎eesti • ‎español • ‎français • ‎italiano • ‎polski • ‎português do Brasil • ‎svenska • ‎русский • ‎українська • ‎հայերեն • ‎العربية • ‎日本語

WDQSことWikidata Query Serviceは、あなたの多くの疑問に答えてくれる素晴らしいツールです。このガイドではWDQSの使い方を説明します。

はじめに

このガイドは大変長いため、少し恐れを感じてしまうかもしれません。でも逃げ出さなくても大丈夫です! SPARQLは複雑ですが、シンプルな基本だけでも役立ちます。ご希望であれば、#Our first queryまでで読むのを止めてもいいですし、そこまででも多くの興味深いクエリーを書けるようになります。その後には、異なるクエリーを書けるようになるためのさらなるトピックがあります。それらはもっと素晴らしいクエリーを書けるようになるはずですが、必須というわけではありません。どこでも読むのを止めても構いませんし、そこまででもたくさん役立つ知識を得られるはずです。

また、もしWikidata, SPARQL, や WDQS のことを全く聞いたことがなければ、これらについての簡単な説明はこちらになります。

  • WDQSことWikidata Query Serviceは、2つの要素からなります: SPARQLクエリーを入力すると、Wikidataのデータセットに対して検索を行い、その結果を表示します。

SPARQLの基本

シンプルなSPARQLクエリーは以下のようになります:

SELECT ?a ?b ?c
WHERE
{
  x y ?a.
  m n ?b.
  ?b f ?c.
}

SELECT節にはクエリの結果として得たい変数を列挙します(変数はクエスチョン・マークで始まります)。WHERE節はそれらの変数に対する制約からなり、ほとんどの場合はトリプルの形をとります。Wikidata(および類似の知識データベース)のすべての情報は、トリプルの形で保存されています。クエリが実行されると、クエリサービスは得たいトリプルの制約を満たすような変数の値を知識データベース上に見つけようと試みます。そしてそのような変数の組み合わせが見つかるごとに、それをひとつの結果として返します。

トリプルは文のように読めます(ピリオドで終わるのはそのためです),。それぞれ主語述語目的語になります:

SELECT ?fruit
WHERE
{
  ?fruit hasColor yellow.
  ?fruit tastes sour.
}

このクエリーへの回答には、例えば“lemon”が含まれます。Wikidataでは、多くのプロパティは“has”の性質を持つプロパティになっています。そのため次のクエリーは以下のように読めます:

SELECT ?fruit
WHERE
{
  ?fruit color yellow.
  ?fruit taste sour.
}

これは?fruit has color ‘yellow’ と読めます(?fruit is the color of ‘yellow’ ではありません — parent/child のような対になるプロパティのためにこれは心に留めておいてください!)。

しかしながら、WDQSの説明のためには、これはあまりいい例ではありませんでした。taste(〜な味がする)は主観的なので、Wikidataにはこれに対応するプロパティがないのです。代わりに、たいていの場合においては曖昧さのない、parent/child(親/子)の関係について考えてみましょう。

初めてのクエリ

バロック音楽の作曲家、ヨハン・ゼバスティアン・バッハの、すべての子供を列挙したいものと考えてみてください。上で見たような、擬似クエリを要素に使うとして、あなたならどんなクエリを書きますか?

こんな感じのクエリを書いたのではないでしょうか。

SELECT ?child
WHERE
{
  #  child "has parent" Bach
  ?child parent Bach.
  # (注:‘#’ より後ろはすべてはコメントで、WDQSには無視されます。)
}

あるいはこんなのや、

SELECT ?child
WHERE
{
  # child "has father" Bach 
  ?child father Bach. 
}

こんなのかもしれませんね。

SELECT ?child
WHERE
{
  #  Bach "has child" child
  Bach child ?child.
}

最初のふたつのトリプルは、?childは parent/father に Bach を持っていなければならない、と言っています。3つめのは、Bach はその子供に?childを持っていなければならない、と言っています。ここでは2つめのを採用することにしましょう。

では、これを適切なWDQSクエリに変換するためには、あと何が必要でしょうか。ウィキデータでは、項目やプロパティを特定するために使われているのは、例えば「父親」(プロパティ)とか、「バッハ」(項目)とかいった、人間にわかりやすい名前ではありません。(これは無理もないことなのです。「ヨハン・ゼバスティアン・バッハ」はドイツの画家の名前でもありますし、「バッハ」は人物のであるだけでなく、フランスの自治体や、火星のクレーターなどの名前でもあり、これらを指しているかもしれないからです。) 代わりに、ウィキデータの項目とプロパティには、識別子が割り当てられています。ある項目の識別子を見つけるには、項目を検索して、(例えば説明文などから)それらしいものが見つかったら、その Q ナンバーをコピーします。プロパティの識別子を見つけるのにも同じことを行いますが、単に検索窓に探している言葉を入力するのではなく、「P:探している言葉」と入力すれば、プロパティに限定して検索を行います。この検索の結果として、あの有名な作曲家のヨハン・ゼバスティアン・バッハは Q1339 であり、そしてある人物の父親を指定するプロパティは P:P22 であることがわかります。

最後に忘れてはいけないのが、接頭辞をつけることです。単純なWDQSトリプルでは、項目にはwd:を、そしてプロパティにはwdt:を、その頭につけなければいけません。(ただしこれが必要なのは値が特定され固定されている場合だけです。変数は接頭辞をとりません。)

これをまとめると、初めての正しいWDQSクエリが出来上がります。

SELECT ?child
WHERE
{
# ?child  father   Bach
  ?child wdt:P22 wd:Q1339.
}

Try it!

「試す!」のリンクをクリックすると、WDQSのページでクエリが実行されます。結果はどうなりましたか?

child
wd:Q57225
wd:Q76428

うーん、これはちょっとがっかりですね。識別子しかわかりません。識別子をクリックすれば該当のWikidataのページを見ることはできるのですが(そこには人にもわかりやすいラベルがあります)、検索結果を見るのにもうちょっといい方法はないのでしょうか?

はい、実はこれがあるんですね!(修辞疑問文って素敵だと思いませんか?)次の魔法の一文を WHERE 節のどこかに追加すると、

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

するとクエリ中のすべての変数に対して、自動的にひとつの変数が追加されます。例えば?fooという変数には?fooLabelという変数が追加され、この変数には?fooを満たす項目のラベルの値が保持されます。この変数をSELECT節に追加すれば、項目とともにそのラベルも得ることができます。

SELECT ?child ?childLabel
WHERE
{
# ?child  father   Bach
  ?child wdt:P22 wd:Q1339.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

このクエリを実行してみてください。すると項目番号だけでなく、色々と子供の名前が表示されるはずです。

child childLabel
wd:Q57225 Johann Christoph Friedrich Bach
wd:Q76428 Carl Philipp Emanuel Bach

自動補完

でもこのSERVICEから始まるコードを覚えとくのはしんどそうですよね? それにクエリを書くたびに検索を繰り返すのもうんざりですよね。幸いなことに、WDQSにはこれに対する大いなる解決策、つまり「自動補完」が用意されています。クエリエディタ query.wikidata.org でクエリを書いている途中に、どこででもいいので Ctrl+Space を押すと、そこで入力するのにふさわしいコードの候補が表示されます。そこで適当な候補を up/down の矢印キーで選んでから、Enter キーを押して確定してください。

たとえば、SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }と毎回毎回入力する代わりに、SERVとだけ入力して、Ctrl+Space キーを押せばいいんです。すると一番上の候補に、ラベル用のこの魔法の一文がバッチリ表示されているものと思います。準備は整いました!あとは Enter キーを押してこれを確定させるだけです。(入力されたコードのフォーマットは少しばかり違っているかもしれませんが、問題ありません。)

自動補完では検索も可能です。wd:wdt:のようなWikidataの節頭辞を入力したあとに、続けて何か文字を入力し、Ctrl+Space キーを押すと、Wikidataをこの文字で検索し、検索結果を候補として表示します。wd:なら項目を検索しますし、wdt:ならプロパティを検索します。たとえば、Johann Sebastian Bach (Q1339)father (P22)を検索して探す代わりに、wd:Bachwdt:fathとだけ入力して、表示された候補の中から正しいものを選ぶだけでいいのです。(この機能は、wd:Johann Sebastian Bachのようにテキスト中にスペースがある場合でも有効です。)

より進んだトリプルのパターン

そういうわけで、いまはヨハン・ゼバスティアン・バッハのすべての子供をみたところでした。より具体的にいうと、父親としてヨハン・ゼバスティアン・バッハを持つすべての項目を得たところです。しかしバッハには妻が2人いましたので、これらの項目は2つの異なる母親を持っています。もしヨハン・ゼバスティアン・バッハの、最初の妻であるマリア・バルバラ・バッハ (Q57487) との間の子供だけをみたい場合は、どうしたらいいでしょうか? 上で書いたクエリをもとに、このクエリを書いてみましょう。

できましたか? オッケー、では解答をみていくことにしましょう! いちばんシンプルな方法は、制約に2つめのトリプルを追加することです。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339.
  ?child wdt:P25 wd:Q57487.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

英語に直すと、これは次のように読むことができます。

Child has father Johann Sebastian Bach.

Child has mother Maria Barbara Bach.

少しばかり不自然に聞こえますね。自然言語であれば、これは次のように略されるでしょう。

Child has father Johann Sebastian Bach and mother Maria Barbara Bach.

実は、SPARQLでも同様に、同じ省略を表現することが可能です。トリプルをピリオドの代わりにセミコロン (;) で終わると、別の述語 ー 目的語の組を追加することができます。これにより上のクエリを次のように略すことが可能です。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

結果は同じですが、クエリ中の繰り返しは減らすことができました。

では、ここまでの結果に加えて、作曲家でもありピアニストでもある子供にしか興味がないとしたらどうでしょうか。これに適したプロパティと項目は、occupation (P106)composer (Q36834)、そしてpianist (Q486748)です。

私の解答は次のようになります。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487;
         wdt:P106 wd:Q36834;
         wdt:P106 wd:Q486748.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

ここでは;を使った省略が、2つの必要な職業を追加するために、 さらに2回使われています。しかし、お気づきかもしれませんが、ご覧のようにここにもまだ繰り返しがあります。このコードだとあたかもこう言っているかのようです。

Child has occupation composer and occupation pianist.

これを私たちは普通、こう略します。

Child has occupation composer and pianist.

SPARQLにはこのためのシンタックスも同様にあります。;が述語と目的語の組をトリプルに追加することを(主語を再利用することによって)可能にしたのと同じように、,を使うことによって、別の目的語をトリプルに追加することが(主語と述語の双方を再利用することによって)可能になります。これによりクエリは次のように省略することができます。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487;
         wdt:P106 wd:Q36834,
                  wd:Q486748.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

注:インデントやその他の空白は、実のところ重要ではありません。クエリがより読みやすくなるように、インデントしただけです。上のクエリは次のように書くことも可能です。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487;
         wdt:P106 wd:Q36834, wd:Q486748.
  # 両方の職業を1行で
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

あるいは、だいぶ読みにくくはなりますが、次のようなのも可能です。

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
  wdt:P25 wd:Q57487;
  wdt:P106 wd:Q36834,
  wd:Q486748.
  # インデントなし。これは ; と , の識別が難しくなる
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

幸いにもWDQSエディタは自動的に行をインデントしてくれますので、通常はこれに頭を悩ませる必要はありません。

さあ、ではここでをまとめてみましょう。クエリは文章のように組み立てられることを見てきました。ある主語についてのトリプルは、それぞれがピリオドで終わります。同じ主語についての複数の述語は、セミコロンで区切られます。そして、同じ主語と述語に対する複数の目的語は、コンマで区切って列挙することが可能です。

SELECT ?s1 ?s2 ?s3
WHERE
{
  ?s1 p1 o1;
      p2 o2;
      p3 o31, o32, o33.
  ?s2 p4 o41, o42.
  ?s3 p5 o5;
      p6 o6.
}

ここでSPARQLで使用可能なもうひとつの省略形を紹介したいと思います。いまいちど仮定のお話におつきあい願うことになりますが…

本当はバッハの子供には関心はなかったとしましょう。(いや、これはひょっとすると、仮定の話ではなかったかもしれませんね!)子供ではなく、その「孫」に(仮定的に)興味があったのでした。ここで厄介な問題がひとつあります。孫は父親がバッハの子供なのかもしれないし、母親がバッハの子供なのかもしれない。どちらでもありうるのです。父親と母親は別々のプロパティなので、これは面倒です。代わりにぐるっと、関係を逆転させてしまいましょう。ウィキデータには「子」というプロパティもあって(P:P40 です)、これは親が主語になって親から子への関係性になるのと同時に、子供の性別を区別しないのです。これを念頭に、バッハの孫を返すクエリを書いてみましょう。書けましたか?

私の解答はこうです。

SELECT ?grandChild ?grandChildLabel
WHERE
{
  wd:Q1339 wdt:P40 ?child.
  ?child wdt:P40 ?grandChild.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

自然言語に直せば、これは次のように読むことができます。

Bach has a child ?child.

?child has a child ?grandChild.

ここでも、上の文章を短縮することを提案します。その上で、SPARQLではどうしたら同様の省略をすることができるかお見せしたいと思います。子供に関心がない場合には、実際にはどうするか考えてみてください。孫のことを話すときにしか変数を使わない、というのが答えです。そこで上の文は次のように略すことができたはずです。

Bach has as child someone who has a child ?grandChild.

ここではバッハの子供が誰かを言う代わりに、単に “someone” と言いました。それが誰であるかには関心がないからです。しかし someone 'who' と言っておくことで、それについて引き続き言及することができます。つまり関係節を作ることができ、その節の中で “someone” について何かをいうことができるということです。(たとえば、he or she “has a child ?grandChild” という風に。)ある意味で、 “someone” は変数なのです。ただしこの修飾節の中でのみ有効な特別な変数で、さらに、それに明示的に言及することはしない変数です(“someone who is this and does that” とは言いますが、“someone who is this and someone who does that” とは言いません。それだと someone は別々の違う「誰か」になるからです。)

SPARQLでは、これを次のように書くことができます。

SELECT ?grandChild ?grandChildLabel
WHERE
{
  wd:Q1339 wdt:P40 [ wdt:P40 ?grandChild ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

変数のあった場所に角括弧の組を使うことができ ([])、これが無名変数として機能します。角括弧の中では述語 - 目的語の組を指定することができ、これは通常のトリプルの後に;を置いた場合と同様です。しかしここでの暗黙的な主語は、角括弧が示すところの無名変数になります。(注:;の後とこれもまた同様に、セミコロンを置けばさらに述語 - 目的語の組を追加することができますし、コンマを置けば述語を共有する目的語をさらに追加することができます。)

これでトリプル・パターンについては終わりです! SPARQL についてはさらに続きますが、自然言語との高い類似性を持つパートを終えるに当たって、その関係性をもう一度まとめておきたいと思います。

自然言語 SPARQL
Juliet loves Romeo. ピリオド juliet loves romeo.
接続詞(節の) Romeo loves Juliet and kills himself. セミコロン romeo loves juliet; kills romeo.
接続詞(名詞の) Romeo kills Tybalt and himself. コンマ romeo kills tybalt, romeo.
関係節 Juliet loves someone who kills Tybalt. 角括弧 juliet loves [ kills tybalt ].


インスタンスとクラス

先に私は、ほとんどのWikidataプロパティは “has” 関係だといいました。has child(子供を持つ)、has father(父親を持つ)、has occupation(職業を持つ)、といったように。しかし時には(実際のところしばしば)、それがなんで「ある」か(what something is)を話す必要があります。しかし実はそこにも2種類の関係があるのです。

  • Gone with the Wind is a film.(『風と共に去りぬ』は映画である
  • A film is a work of art.(映画は芸術作品である

『風と共に去りぬ』は一つの特定の映画です。そこには特定の監督(ヴィクター・フレミング)がおり、具体的な上映時間(238分)があり、キャストメンバー表(クラーク・ゲーブル、ヴィヴィアン・リー、…)があり、…といった具合です。

「映画」は一般的な概念です。個々の映画は監督や上映時間やキャストを持つことができますが、「映画」という概念それ自体は、どんな特定の監督も上映時間もキャストも持ちません。そして映画は芸術作品であり(a film is a work of art)、芸術作品は通常その作者を持ちますが、「映画」という概念自体は作者を持ちません。この概念の特定の「実例(instance インスタンス)」のみが、それを持ちます。

この違いゆえに、Wikidataには “is”(〜である)を表すプロパティが2つ、つまりinstance of (P31)subclass of (P279)があるのです。『風と共に去りぬ』は「映画」というクラスの特定のインスタンスです。「映画」というクラスは、より一般的なクラスである「芸術作品」の下位クラス(より具体的なクラス。特殊化)です。

To help you to figure about the difference, you can try to use two different verbs: "is a" and " is a kind of". If "is a kind of" works (e.g. A film "is a kind of" work of art), it indicates that you are talking about a subclass, a specialization of a broader class and you should use subclass of (P279). If "is a kind of" does not work (e.g. the sentence Gone with the wind "is a kind of" film does not make sense), it indicates that you are talking about a particular instance and you should use instance of (P31).

So what does this mean for us when we’re writing SPARQL queries? When we want to search for “all works of art”, it’s not enough to search for all items that are directly instances of “work of art”:

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31 wd:Q838948. # instance of work of art
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

As I’m writing this, that query only returns 2815 results – obviously, there are more works of art than that! The problem is that this misses items like Gone with the Wind, which is only an instance of “film”, not of “work of art”. “film” is a subclass of “work of art”, but we need to tell SPARQL to take that into account when searching.

One possible solution to this is the [] syntax we talked about: Gone with the Wind is an instance of some subclass of “work of art”. (For exercise, try writing that query!) But that still has problems:

  1. We’re no longer including items that are directly instances of work of art.
  2. We’re still missing items that are instances of some subclass of some other subclass of “work of art” – for example, Snow White and the Seven Dwarfs is an animated film, which is a film, which is a work of art. In this case, we need to follow two “subclass of” statements – but it might also be three, four, five, any number really.

The solution: ?item wdt:P31/wdt:P279* ?class. This means that there’s one “instance of” and then any number of “subclass of” statements between the item and the class.

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q838948. # instance of any subclass of work of art
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

(I don’t recommend running that query. WDQS can handle it (just barely), but your browser might crash when trying to display the results because there are so many of them.)

Now you know how to search for all works of art, or all buildings, or all human settlements: the magic incantation wdt:P31/wdt:P279*, along with the appropriate class. This uses some more SPARQL features that I haven’t explained yet, but quite honestly, this is almost the only relevant use of those features, so you don’t need to understand how it works in order to use WDQS effectively. If you want to know, I’ll explain it in a bit, but you can also just skip the next section and memorize or copy+paste wdt:P31/wdt:P279* from here when you need it.

Property paths

Property paths are a way to very tersely write down a path of properties between two items. The simplest path is just a single property, which forms an ordinary triple:

?item wdt:P31 ?class.

You can add path elements with a forward slash (/).

?item wdt:P31/wdt:P279/wdt:P279 ?class.

This is equivalent to either of the following:

?item wdt:P31 ?temp1.
?temp1 wdt:P279 ?temp2.
?temp2 wdt:P279 ?class.
?item wdt:P31 [ wdt:P279 [ wdt:P279 ?class ] ].

エクササイズ: 先程出てきた「バッハの孫」クエリーをこの構文を使って書き直す。

一つのアスタリスク (*) をパス要素の後につけることでゼロ回以上の要素を表します。

?item wdt:P31/wdt:P279* ?class.
# means:
?item wdt:P31 ?class
# or
?item wdt:P31/wdt:P279 ?class
# or
?item wdt:P31/wdt:P279/wdt:P279 ?class
# or
?item wdt:P31/wdt:P279/wdt:P279/wdt:P279 ?class
# or ...

If there are no other elements in the path, ?a something* ?b means that ?b might also just be ?a directly, with no path elements between them at all.

A plus (+) is similar to an asterisk, but means “one or more of this element”. The following query finds all descendants of Bach:

SELECT ?descendant ?descendantLabel
WHERE
{
  wd:Q1339 wdt:P40+ ?descendant.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

If we used an asterisk instead of a plus here, the query results would include Bach himself.

A question mark (?) is similar to an asterisk or a plus, but means “zero or one of this element”.

You can separate path elements with a vertical bar (|) instead of a forward slash; this means “either-or”: the path might use either of those properties. (But not combined – an either-or path segment always matches a path of length one.)

You can also group path elements with parentheses (()), and freely combine all these syntax elements (/|*+?). This means that another way to find all descendants of Bach is:

SELECT ?descendant ?descendantLabel
WHERE
{
  ?descendant (wdt:P22|wdt:P25)+ wd:Q1339.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

Instead of using the “child” property to go from Bach to his descendants, we use the “father” and “mother” properties to go from the descendants to Bach. The path might include two mothers and one father, or four fathers, or father-mother-mother-father, or any other combination. (Though, of course, Bach can’t be the mother of someone, so the last element will always be father.)

修飾子

(Good news first: this section introduces no additional SPARQL syntax – yay! Take a quick breath and relax, this should be a piece of cake. Right?)

今までのところは、主語・プロパティ・目的語といった、単純な文についてのみ話してきました。しかし、ウィキデータの文にはそれだけではなく、修飾子 (qualifier) と 出典 (references) が含まれ得ます。例えば、モナリザ (Q12418) には、material used (P186)の文が3つあります。

  1. oil paint (Q296955)は、主な材料です。
  2. poplar wood (Q291034)には、 applies to part (P518) painting surface (Q861259)の修飾子がついています。これは、モナリザが描かれた面の材質です。
  3. wood (Q287)には、 applies to part (P518) stretcher (Q1737943)start time (P580) 1951の修飾子がついています。これは、後に付け足された材料です。

Suppose we want to find all paintings with their painting surface, that is, those material used (P186) statements with a qualifier applies to part (P518) painting surface (Q861259). How do we do that? That’s more information than can be represented in a single triple.

The answer is: more triples! (Rule of thumb: Wikidata’s solution for almost everything is “more items”, and the corresponding WDQS rule is “more triples”. References, numeric precision, values with units, geocoordinates, etc., all of which we’re skipping here, also work like this.) So far, we’ve used the wdt: prefix for our statement triples, which points directly to the object of the statement. But there’s also another prefix: p:, which points not to the object, but to a statement node. This node then is the subject of other triples: the prefix ps: (for property statement) points to the statement object, the prefix pq: (property qualifier) to qualifiers, and prov:wasDerivedFrom points to reference nodes (which we’ll ignore for now).

That was a lot of abstract text. Here’s a concrete example for the Mona Lisa:

wd:Q12418 p:P186 ?statement1.    # Mona Lisa: material used: ?statement1
?statement1 ps:P186 wd:Q296955.  # value: oil paint

wd:Q12418 p:P186 ?statement2.    # Mona Lisa: material used: ?statement2
?statement2 ps:P186 wd:Q291034.  # value: poplar wood
?statement2 pq:P518 wd:Q861259.  # qualifier: applies to part: painting surface

wd:Q12418 p:P186 ?statement3.    # Mona Lisa: material used: ?statement3
?statement3 ps:P186 wd:Q287.     # value: wood
?statement3 pq:P518 wd:Q1737943. # qualifier: applies to part: stretcher bar
?statement3 pq:P580 1951.        # qualifier: start time: 1951 (pseudo-syntax)

We can abbreviate this a lot with the [] syntax, replacing the ?statement variables:

wd:Q12418 p:P186 [ ps:P186 wd:Q296955 ].

wd:Q12418 p:P186 [
            ps:P186 wd:Q291034;
            pq:P518 wd:Q861259
          ].

wd:Q12418 p:P186 [
            ps:P186 wd:Q287;
            pq:P518 wd:Q1737943;
            pq:P580 1951
          ].

Can you use this knowledge to write a query for all paintings with their painting surface?

私の解答はこうです。

SELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

First, we limit ?painting to all instances of painting (Q3305213) or a subclass thereof. Then, we extract the material from the p:P186 statement node, limiting the statements to those that have an applies to part (P518) painting surface (Q861259) qualifier.

ORDER and LIMIT

We return to our regular scheduled program of more SPARQL features.

So far, we’ve only had queries where we were interested in all results. But it’s quite common to care only about a few results: those that are most extreme in some way – oldest, youngest, earliest, latest, highest population, lowest melting point, most children, most materials used, and so on. The common factor here is that the results are ranked in some way, and then we care about the first few results (those with the best rank).

This is controlled by two clauses, which are appended to the WHERE {} block (after the braces, not inside!): ORDER BY and LIMIT.

ORDER BY something sorts the results by something. something can be any expression – for now, the only kind of expression we know are simple variables (?something), but we’ll see some other kinds later. This expression can also be wrapped in either ASC() or DESC() to specify the sorting order (ascending or descending). (If you don’t specify either, the default is ascending sort, so ASC(something) is equivalent to just something.)

LIMIT count cuts off the result list at count results, where count is any natural number. For example, LIMIT 10 limits the query to ten results. LIMIT 1 only returns a single result.

(You can also use LIMIT without ORDER BY. In this case, the results aren’t sorted, so you don’t have any guarantee which results you’ll get. Which is fine if you happen to know that there’s only a certain number of results, or you’re just interested in some result, but don’t care about which one. In either case, adding the LIMIT can significantly speed up the query, since WDQS can stop searching for results as soon as it’s found enough to fill the limit.)

Exercise time! Try to write a query that returns the ten most populous countries. (A country is a sovereign state (Q3624078), and the property for population is P:P1082.) You can start by searching for countries with their population, and then add the ORDER BY and LIMIT clauses.

私の解答はこうです。

SELECT ?country ?countryLabel ?population
WHERE
{
  ?country wdt:P31/wdt:P279* wd:Q3624078;
           wdt:P1082 ?population.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)
LIMIT 10

Try it!

Note that if we want the most populous countries, we have to order by descending population, so that the first results will be the ones with the highest values.

Exercise

We’ve covered a lot of ground so far – I think it’s time for some exercises. (You can skip this section if you’re in a hurry.)

Arthur Conan Doyle books

Write a query that returns all books by Sir Arthur Conan Doyle.

Chemical elements

Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.

Rivers that flow into the Mississippi

Write a query that returns all rivers that flow directly into the Mississippi River. (The main challenge is finding the correct property…)

Rivers that flow into the Mississippi II

Write a query that returns all rivers that flow into the Mississippi River, directly or indirectly.

OPTIONAL

In the exercises above, we had a query for all books by Sir Arthur Conan Doyle:

SELECT ?book ?bookLabel
WHERE
{
  ?book wdt:P50 wd:Q35610.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

But that’s a bit boring. There’s so much potential data about books, and we only show the label? Let’s try to craft a query that also includes the title (P1476), illustrator (P110), publisher (P123) and publication date (P577).

A first attempt might look like this:

SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610;
        wdt:P1476 ?title;
        wdt:P110 ?illustrator;
        wdt:P123 ?publisher;
        wdt:P577 ?published.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

Run that query. As I’m writing this, it only returns two results – a bit meager! Why is that? We found over a hundred books earlier!

The reason is that to match this query, a potential result (a book) must match all the triples we listed: it must have a title, and an illustrator, and a publisher, and a publication date. If it has some of those properties, but not all of them, it won’t match. And that’s not what we want in this case: we primarily want a list of all the books – if additional data is available, we’d like to include it, but we don’t want that to limit our list of results.

The solution is to tell WDQS that those triples are optional:

SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610.
  OPTIONAL { ?book wdt:P1476 ?title. }
  OPTIONAL { ?book wdt:P110 ?illustrator. }
  OPTIONAL { ?book wdt:P123 ?publisher. }
  OPTIONAL { ?book wdt:P577 ?published. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

This gives us the additional variables (?title, ?publisher etc.) if the appropriate statement exists, but if the statement doesn’t exist, the result isn’t discarded – the variable simply isn’t set.

Note: it’s very important to use separate OPTIONAL clauses here. If you put all the triples into a single clause, like here –

SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610.
  OPTIONAL {
    ?book wdt:P1476 ?title;
          wdt:P110 ?illustrator;
          wdt:P123 ?publisher;
          wdt:P577 ?published.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

– you’ll notice that most of the results don’t include any extra information. This is because an optional clause with multiple triples only matches when all those triples can be satisfied. That is: if a book has a title, an illustrator, a publisher, and a publication date, then the optional clause matches, and those values are assigned to the appropriate variables. But if a book has, for example, a title but no illustrator, the entire optional clause doesn’t match, and although the result isn’t discarded, all four variables remain empty.

Expressions, FILTER and BIND

This section might seem a bit less organized than the other ones, because it covers a fairly wide and diverse topic. The basic concept is that we would like to do something with the values that, so far, we’ve just selected and returned indiscriminately. And expressions are the way to express these operations on values. There are many kinds of expressions, and a lot of things you can do with them – but first, let’s start with the basics: data types.

Data types

Each value in SPARQL has a type, which tells you what kind of value it is and what you can do with it. The most important types are:

  • item, like wd:Q42 for Douglas Adams (Q42).
  • boolean, with the two possible values true and false. Boolean values aren’t stored in statements, but many expressions return a boolean value, e.g. 2 < 3 (true) or "a" = "b" (false).
  • string, a piece of text. String literals are written in double quotes.
  • monolingual text, a string with a language tag attached. In a literal, you can add the language tag after the string with an @ sign, e.g. "Douglas Adams"@en.
  • numbers, either integers (1) or decimals (1.23).
  • dates. Date literals can be written by adding ^^xsd:dateTime (case sensitive – ^^xsd:datetime won’t work!) to an ISO 8601 date string: "2012-10-29"^^xsd:dateTime.

演算子

The familiar mathematical operators are available: +, -, *, / to add, subtract, multiply or divide numbers, <, >, =, <=, >= to compare them. The inequality test ≠ is written !=. Comparison is also defined for other types; for example, "abc" < "abd" is true (lexical comparison), as is "2016-01-01"^^xsd:dateTime > "2015-12-31"^^xsd:dateTime and wd:Q4653 != wd:Q283111. And boolean conditions can be combined with && (logical and: a && b is true if both a and b are true) and || (logical or: a || b is true if either (or both) of a and b is true).

FILTER

  情報 For a sometimes faster alternative to FILTER, you might also look at MINUS, see example.

FILTER(condition). is a clause you can insert into your SPARQL query to, well, filter the results. Inside the parentheses, you can put any expression of boolean type, and only those results where the expression returns true are used.

For example, to get a list of all humans born in 2015, we first get all humans with their date of birth –

SELECT ?person ?personLabel ?dob
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?dob.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

– and then filter that to only return the results where the year of the date of birth is 2015. There are two ways to do that: extract the year of the date with the YEAR function, and test that it’s 2015 –

FILTER(YEAR(?dob) = 2015).

– or check that the date is between Jan. 1st (inclusive), 2015 and Jan. 1st, 2016 (exclusive):

FILTER("2015-01-01"^^xsd:dateTime <= ?dob && ?dob < "2016-01-01"^^xsd:dateTime).

I’d say that the first one is more straightforward, but it turns out the second one is much faster, so let’s use that:

SELECT ?person ?personLabel ?dob
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?dob.
  FILTER("2015-01-01"^^xsd:dateTime <= ?dob && ?dob < "2016-01-01"^^xsd:dateTime).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". } 
}

Try it!

Another possible use of FILTER is related to labels. The label service is very useful if you just want to display the label of a variable. But if you want to do stuff with the label – for example: check if it starts with “Mr. ” – you’ll find that it doesn’t work:

SELECT ?human ?humanLabel
WHERE
{
  ?human wdt:P31 wd:Q15632617.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  #This FILTER does not work!
  FILTER(STRSTARTS(?humanLabel, "Mr. ")).
}

Try it!

This query finds all instances of fictional human (Q15632617) and tests if their label starts with "Mr. " (STRSTARTS is short for “string starts [with]”; there’s also STRENDS and CONTAINS). The reason why this doesn’t work is that the label service adds its variables very late during query evaluation; at the point where we try to filter on ?humanLabel, the label service hasn’t created that variable yet.

Fortunately, the label service isn’t the only way to get an item’s label. Labels are also stored as regular triples, using the predicate rdfs:label. Of course, this means all labels, not just English ones; if we only want English labels, we’ll have to filter on the language of the label:

FILTER(LANG(?label) = "en").

The LANG function returns the language of a monolingual string, and here we only select those labels that are in English. The full query is:

SELECT ?human ?label
WHERE
{
  ?human wdt:P31 wd:Q15632617;
         rdfs:label ?label.
  FILTER(LANG(?label) = "[AUTO_LANGUAGE]").
  FILTER(STRSTARTS(?label, "Mr. ")).
}

Try it!

We get the label with the ?human rdfs:label ?label triple, restrict it to English labels, and then check if it starts with “Mr. ”.

One can also use FILTER with a regular expression. In the following example

SELECT ?item ?itemLabel ?bblid
WHERE {  
    ?item wdt:P2580 ?bblid .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }  
    FILTER(!REGEX(STR(?bblid), "[\\.q]")) 
}

Try it!

If the format constraint for an ID is [A-Za-z][-.0-9A-Za-z]{1,}:

SELECT ?item ?itemLabel ?bblid
WHERE {  
    ?item wdt:P2580 ?bblid .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }  
    FILTER(!REGEX(STR(?bblid), "^[A-Za-z][-.0-9A-Za-z]{1,}$"))
}

Try it!

It is possible to filter out specific elements like this

FILTER ( ?item not in ( wd:Q4115189,wd:Q13406268,wd:Q15397819 ) )

It is possible to filter and have elements that aren't filled:

FILTER ( NOT EXISTS { ?item  wdt:P21 [] } )

BIND, BOUND, IF

These three features are often used in conjunction, so I’ll first explain all three of them and then show you some examples.

A BIND(expression AS ?variable). clause can be used to assign the result of an expression to a variable (usually a new variable, but you can also overwrite existing ones).

BOUND(?variable) tests if a variable has been bound to a value (returns true or false). It’s mostly useful on variables that are introduced in an OPTIONAL clause.

IF(condition,thenExpression,elseExpression) evaluates to thenExpression if condition evaluates to true, and to elseExpression if condition evaluates to false. That is, IF(true, "yes", "no") evaluates to "yes", and IF(false, "great", "terrible") evaluates to "terrible".

BIND can be used to bind the results of some calculation to a new variable. This can be an intermediate result of a larger calculation or just directly a result of the query. For example, to get the age of victims of capital punishment:

SELECT ?person ?personLabel ?age
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?born;
          wdt:P570 ?died;
          wdt:P1196 wd:Q8454.
  BIND(?died - ?born AS ?ageInDays).
  BIND(?ageInDays/365.2425 AS ?ageInYears).
  BIND(FLOOR(?ageInYears) AS ?age).
  # or, as one expression:
  #BIND(FLOOR((?died - ?born)/365.2425) AS ?age).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

BIND can also be used to simply bind constant values to variables in order to increase readability. For example, a query that finds all female priests:

SELECT ?woman ?womanLabel
WHERE
{
  ?woman wdt:P31 wd:Q5;
         wdt:P21 wd:Q6581072;
         wdt:P106 wd:Q42603.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

can be rewritten like this:

SELECT ?woman ?womanLabel
WHERE
{
  BIND(wdt:P31 AS ?instanceOf).
  BIND(wd:Q5 AS ?human).
  BIND(wdt:P21 AS ?sexOrGender).
  BIND(wd:Q6581072 AS ?female).
  BIND(wdt:P106 AS ?occupation).
  BIND(wd:Q42603 AS ?priest).
  ?woman ?instanceOf ?human;
         ?sexOrGender ?female;
         ?occupation ?priest.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

The meaningful part of the query, from ?woman to ?priest., is now probably more readable. However, the large BIND block right in front of it is pretty distracting, so this technique should be used sparingly. (In the WDQS user interface, you can also hover your mouse over any term like wd:Q123 or wdt:P123 and see the label and description for the entity, so ?female is only more readable than wd:Q6581072 if you ignore that feature.)

IF expressions are often used with condition-expressions built with BOUND. For example, suppose you have a query that shows some humans, and instead of just showing their label, you’d like to display their pseudonym (P742) if they have one, and only use the label if a pseudonym doesn’t exist. For this, you select the pseudonym in an OPTIONAL clause (it has to be optional – you don’t want to throw out results that don’t have a pseudonym), and then use BIND(IF(BOUND(… to select either the pseudonym or the label.

SELECT ?writer ?label
WHERE
{
  # French writer born in the second half of the 18th century
  ?writer wdt:P31 wd:Q5;
          wdt:P27 wd:Q142;
          wdt:P106 wd:Q36180;
          wdt:P569 ?dob.
  FILTER("1751-01-01"^^xsd:dateTime <= ?dob && ?dob < "1801-01-01"^^xsd:dateTime).
  # get the English label
  ?writer rdfs:label ?writerLabel.
  FILTER(LANG(?writerLabel) = "en").
  # get the pseudonym, if it exists
  OPTIONAL { ?writer wdt:P742 ?pseudonym. }
  # bind the pseudonym, or if it doesn’t exist the English label, as ?label
  BIND(IF(BOUND(?pseudonym),?pseudonym,?writerLabel) AS ?label).
}

Try it!

Other properties that may be used in this way include nickname (P1449), posthumous name (P1786), and taxon common name (P1843) – anything where some sort of “fallback” makes sense.

You can also combine BOUND with FILTER to ensure that at least one of several OPTIONAL blocks has been fulfilled. For example, let’s get all astronauts that went to the moon, as well as the members of Apollo 13 (Q182252) (close enough, right?). That restriction can’t be expressed as a single property path, so we need one OPTIONAL clause for “member of some moon mission” and another one for “member of Apollo 13”. But we only want to select those results where at least one of those conditions is true.

SELECT ?astronaut ?astronautLabel
WHERE
{
  ?astronaut wdt:P31 wd:Q5;
             wdt:P106 wd:Q11631.
  OPTIONAL {
    ?astronaut wdt:P450 ?mission.
    ?mission wdt:P31 wd:Q495307.
  }
  OPTIONAL {
    ?astronaut wdt:P450 wd:Q182252.
    BIND(wd:Q182252 AS ?mission).
  }
  FILTER(BOUND(?mission)).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it!

COALESCE

The COALESCE function can be used as an abbreviation of the BIND(IF(BOUND(?x), ?x, ?y) AS ?z). pattern for fallbacks mentioned above: it takes a number of expressions and returns the first one that evaluates without error. For example, the above “pseudonym” fallback

BIND(IF(BOUND(?pseudonym),?pseudonym,?writerLabel) AS ?label).

can be written more concisely as

BIND(COALESCE(?pseudonym, ?writerLabel) AS ?label).

and it’s also easy to add another fallback label in case the ?writerLabel isn’t defined either:

BIND(COALESCE(?pseudonym, ?writerLabel, "<no label>") AS ?label).

Grouping

So far, all the queries we’ve seen were queries that found all items satisfying some conditions; in some cases, we also included extra statements on the item (paintings with materials, Arthur Conan Doyle books with title and illustrator).

But it’s very common that we don’t want a long list of all results. Instead, we might ask questions like this:

  • How many paintings were painted on canvas / poplar wood / etc.?
  • What is the highest population of each country’s cities?
  • What is the total number of guns produced by each manufacturer?
  • Who publishes, on average, the longest books?

City populations

Let’s look at the second question for now. It’s fairly simple to write a query that lists all cities along with their population and country, ordered by country:

SELECT ?country ?city ?population
WHERE
{
  ?city wdt:P31/wdt:P279* wd:Q515;
        wdt:P17 ?country;
        wdt:P1082 ?population.
}
ORDER BY ?country

Try it!

(Note: that query returns a lot of results, which might cause trouble for your browser. You might want to add a LIMIT clause.)

Since we’re ordering the results by country, all cities belonging to a country form one contiguous block in the results. To find the highest population within that block, we want to consider the block as a group, and aggregate all the individual population values into one value: the maximum. This is done with a GROUP BY clause below the WHERE block, and an aggregate function (MAX) in the SELECT clause.

SELECT ?country (MAX(?population) AS ?maxPopulation)
WHERE
{
  ?city wdt:P31/wdt:P279* wd:Q515;
        wdt:P17 ?country;
        wdt:P1082 ?population.
}
GROUP BY ?country

Try it!

We’ve replaced the ORDER BY with a GROUP BY. The effect of this is that all results with the same ?country are now grouped together into a single result. This means that we have to change the SELECT clause as well. If we kept the old clause SELECT ?country ?city ?population, which ?city and ?population would be returned? Remember, there are many results in this one result; they all have the same ?country, so we can select that, but since they can all have a different ?city and ?population, we have to tell WDQS which of those values to select. That’s the job of the aggregate function. In this case, we’ve used MAX: out of all the ?population values, we select the maximum one for the group result. (We also have to give that value a new name with the AS construct, but that’s just a minor detail.)

This is the general pattern for writing group queries: write a normal query that returns the data you want (not grouped, with many results per “group”), then add a GROUP BY clause and add an aggregate function all the non-grouped variables in the SELECT clause.

Painting materials

Let’s try it out with another question: How many paintings were painted on each material? First, write a query that just returns all paintings along with their painting material. (Take care to only use those material used (P186) statements with an applies to part (P518) painting surface (Q861259) qualifier.)

SELECT ?material ?painting
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
}

Try it!

Next, add a GROUP BY clause on the ?material, and then an aggregate function on the other selected variable (?painting). In this case, we are interested in the number of paintings; the aggregate function for that is COUNT.

SELECT ?material (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
}
GROUP BY ?material

Try it!

One problem with this is that we don’t have the label for the materials, so the results are a bit inconvenient to interpret. If we just add the label variable, we’ll get an error:

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
GROUP BY ?material

Try it!

Bad aggregate

“Bad aggregate” is an error message you’ll probably see a lot when working with group queries; it means that one of the selected variables needs an aggregate function but doesn’t have one, or it has an aggregate function but isn’t supposed to have one. In this case, WDQS thinks that there might be multiple ?materialLabels per ?material (even though we know that can’t happen), and so it complains that you’re not specifying an aggregate function for that variable.

One solution is to group over multiple variables. If you list multiple variables in the GROUP BY clause, there’s one result for each combination of those variables, and you can select all those variables without aggregate function. In this case, we’ll group over both ?material and ?materialLabel.

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
GROUP BY ?material ?materialLabel

Try it!

We’re almost done with the query – just one more improvement: we’d like to see the most-used materials first. Fortunately, we’re allowed to use the new, aggregated variables from the SELECT clause (here, ?count) in an ORDER BY clause, so this is very simple:

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
GROUP BY ?material ?materialLabel
ORDER BY DESC(?count)

Try it!

As an exercise, let’s do the other queries too.

Guns by manufacturer

What is the total number of guns produced by each manufacturer?

Publishers by number of pages

What is the average (function: AVG) number of pages of books by each publisher?

HAVING

A small addendum to that last query – if you look at the results, you might notice that the top result has an outrageously large average, over ten times that of the second place. A bit of investigation reveals that this is because that publisher (UTET (Q4002388)) only published a single book with a number of pages (P1104) statement, Grande dizionario della lingua italiana (Q3775610), which skews the results a bit. To remove outliers like that, we could try to select only publishers that published at least two books with number of pages (P1104) statements on Wikidata.

How do we do that? Normally, we restrict results with a FILTER clause, but in this case we want to restrict based on the group (the number of books), not any individual result. This is done with a HAVING clause, which can be placed right after a GROUP BY clause and takes an expression just like FILTER does:

SELECT ?publisher ?publisherLabel (AVG(?pages) AS ?avgPages)
WHERE
{
  ?book wdt:P123 ?publisher;
        wdt:P1104 ?pages.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
GROUP BY ?publisher ?publisherLabel
HAVING(COUNT(?book) > 1)
ORDER BY DESC(?avgPages)

Try it!

Aggregate functions summary

Here’s a short summary of the available aggregate functions:

  • COUNT: the number of elements. You can also write COUNT(*) to simply count all results.
  • SUM, AVG: the sum or average of all elements, respectively. If the elements aren’t numbers, you’ll get weird results.
  • MIN, MAX: the minimum or maximum value of all elements, respectively. This works for all value types; numbers are sorted numerically, strings and other types lexically.
  • SAMPLE: any element. This is occasionally useful if you know there’s only one result, or if you don’t care which one is returned.
  • GROUP_CONCAT: concatenates all elements. Useful for example if you want only one result for an item but you want to include informations for a property that may have several statements for this item, such as the occupations of a person. The different occupations may be regrouped and concatenated to appear all in only one variable instead of several lines in the results. If you’re curious, you can look it up in the SPARQL specification.

Additionally, you can add a DISTINCT modifier for any of these functions to eliminate duplicate results. For example, if there are two results but they both have the same value in ?var, then COUNT(?var) will return 2 but COUNT(DISTINCT ?var) will only return 1. You often have to use DISTINCT when your query can return the same item multiple times – this can happen if, for example, you use ?item wdt:P31/wdt:P279* ?class, and there are multiple paths from ?item to ?class: you will get a new result for each of those paths, even though all the values in the result are identical. (If you’re not grouping, you can also eliminate those duplicate results by starting the query with SELECT DISTINCT instead of just SELECT.)

wikibase:Label and aggregations bug

There is currently (February 2020) a problem with the query service when you want to use the wikibase:label service with aggregation functions. A query such as the following, which searches all academic persons with more than two countries of citizenships in Wikidata and is supposed to show the names of those countries in an aggregate string:

select ?person ?personLabel (group_concat(?citizenshipLabel;separator="/") as ?citizenships) {
  # find all academics
  ?person wdt:P106 wd:Q3400985 ;   
          wdt:P27  ?citizenship .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} group by ?person ?personLabel having (count(?citizenship) > 2)

Try it!

fails to show anything in the ?citizenships column. A workaround is to explicitely name the ?personLabel and ?citizenshipLabel in the wikibase:label service call like this:

  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
    ?citizenship rdfs:label ?citizenshipLabel .
    ?person      rdfs:label ?personLabel .
  }

The following query works as expected:

select ?person ?personLabel (group_concat(?citizenshipLabel;separator="/") as ?citizenships) {
  ?person wdt:P106 wd:Q3400985 ;
          wdt:P27  ?citizenship .
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
    ?citizenship rdfs:label ?citizenshipLabel .
    ?person      rdfs:label ?personLabel .
  }
} group by ?person ?personLabel having (count(?citizenship) > 2)

Try it!

VALUES

One can select items based on a list of items:

SELECT ?item ?itemLabel ?mother ?motherLabel WHERE {
  VALUES ?item { wd:Q937 wd:Q1339 }
  OPTIONAL { ?item wdt:P25 ?mother. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

One can also select based on a list of values of a specific property:

SELECT ?item ?itemLabel ?mother ?motherLabel ?ISNI WHERE {
  VALUES ?ISNI { "0000 0001 2281 955X" "0000 0001 2276 4157" }
  ?item wdt:P213 ?ISNI.
  OPTIONAL { ?item wdt:P25 ?mother. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

VALUES can also do more and build enumerations of values possible for a couple (or a tuple) of variables. For example say you want to use (known) custom labels for the persons enumerated in the first « value » example. It’s then possible to use a « values » clause such as VALUES (?item ?customItemLabel) { (wd:Q937 "Einstein") (wd:Q1339 "Bach") } which ensures that whenever ?item has value wd:Q937 in a result, ?customItemLabel own value is Einstein and whenever ?item has value wd:Q1339, ?customItemLabel’s value is Bach.

SELECT ?item ?customItemLabel ?mother ?motherLabel WHERE {
  VALUES (?item ?customItemLabel) { (wd:Q937 "Einstein") (wd:Q1339 "Bach") }
  OPTIONAL { ?item wdt:P25 ?mother. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

And beyond…

This guide ends here. SPARQL doesn’t: there’s still a lot that I haven’t shown you – I never promised this was going to be a complete guide! If you got this far, you already know a lot about WDQS and should be able to write some very powerful queries. But if you want to learn even more, here are some things you can look at:

  • Subqueries. You add another entire query in curly brackets ({ SELECT ... WHERE { ... } LIMIT 10 }), and the results are visible in the outer query. (If you’re familiar with SQL, you’ll have to rethink the concept a bit – SPARQL subqueries are purely “bottom-up” and can’t use values from the outer query, like SQL “correlated subqueries” can.)
  • MINUS lets you select results that don’t fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit.

Your main reference for these and other topics is the SPARQL specification.

Also, you can take a look at SPARQL tutorial on Wikibooks and this tutorial by data.world.

And of course, there are some parts of Wikidata still missing as well, such as references, numeric precision (100±2.5), values with units (two kilograms), geocoordinates, sitelinks, statements on properties, and more. You can see how those are modeled as triples under mw:Wikibase/Indexing/RDF Dump Format.

See also