Wikidata:Lexicographical data/Documentation/Query Service

This page or section is in the process of an expansion or major restructuring. You are welcome to assist in its construction by editing it as well.

This page serves as a guide to using the Wikidata Query Service to retrieve information about lexemes. It presupposes familiarity with the gentle introduction to the Query Service, and does make use of some of the facilities described in the Query Service user manual, but it does not presuppose familiarity with concepts described in the illustrative queries page.

You can learn more about the structure of lexemes and their forms and senses on the main documentation page.

Querying lexemes

edit

No matter how complete or detailed individual Wikidata lexemes may be, there are three pieces of information that each lexeme contains: its language, its lemma(ta), and its lexical category.

Querying lexemes by language

edit

Each Wikidata lexeme has a single language, represented as a Wikidata item, associated with it (such as English (Q1860), Bangla (Q9610), or Hindustani (Q11051)). It is possible to retrieve a list of lexemes with lexeme language English (Q1860) using the dct:language[1] predicate as follows:

?lexeme dct:language ?language

#title:Lexemes in English
SELECT ?lexeme {
    ?lexeme dct:language wd:Q1860 .
}
Lexemes in English

For an arbitrary set of lexemes, the languages can also be obtained separately, and the labels of those languages may be displayed using the wikibase:label[2] service:

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

#title:Ten random lexemes and the names of their lexeme languages
SELECT ?lexeme ?language ?languageLabel {
    ?lexeme dct:language ?language .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 10
Ten random lexemes and the names of their lexeme languages

Note that the wikibase:label service only provides item labels, item descriptions, and item aliases automatically; it does not automatically display any lexeme lemmata, sense glosses, or form representations. For obtaining those, see the subsections "Querying lexeme lemmata", "Filtering definitions", and "Filtering form representations" respectively.

Querying lexeme lemmata

edit
Read more about lexical categories at Wikidata:Lexicographical data/Documentation/Lemmata

The actual lemmata of a lexeme may be obtained using the wikibase:lemma[3] predicate:

?lexeme wikibase:lemma ?lemma

#title:Lexemes in English and their lemmata
SELECT ?lexeme ?lemma {
    ?lexeme dct:language wd:Q1860 .
    ?lexeme wikibase:lemma ?lemma .
}
Lexemes in English and their lemmata

Each lexeme lemma, like each item label, item description, and item alias, consists of a string of text attached to a language code. When there are multiple lemmata on a lexeme (to handle multiple writing systems or other orthographic variation within a language), you can filter the language code of the lemma using a string comparison with LANG():

FILTER(LANG(?lemma)="ur")

#title:Lexemes in Hindustani and their "hi" lemmata
SELECT ?lexeme ?lemma {
    ?lexeme dct:language wd:Q11051 .
    ?lexeme wikibase:lemma ?lemma .
    FILTER(LANG(?lemma)="hi")
}
Lexemes in Hindustani and their "hi" lemmata

You can also perform many string operations—STRSTARTS(), CONTAINS(), REGEX(), and so on—on the lemmata themselves:

FILTER(STRENDS(?lemma,"päev"))

#title:Lexemes in Estonian and their lemmata beginning with "aja"
SELECT ?lexeme ?lemma {
    ?lexeme dct:language wd:Q9072 .
    ?lexeme wikibase:lemma ?lemma .
    FILTER(STRSTARTS(?lemma,"aja"))
}
Lexemes in Estonian and their lemmata beginning with "aja"

Some operations involving lexicographical ordering—"<", ">=", and so on—require that the lemmata be wrapped in STR() before applying the operation:

FILTER(STR(?lemma) >= "terv")

#title:Lexemes in Estonian with lemmata that appear before "aja" in alphabetical order
SELECT ?lexeme ?lemma {
    ?lexeme dct:language wd:Q9072 .
    ?lexeme wikibase:lemma ?lemma .
    FILTER(STR(?lemma) < "aja")
}
Lexemes in Estonian with lemmata that appear before "aja" in alphabetical order

Filtering lexical categories

edit

In addition to being associated with a single language, each Wikidata lexeme is associated with a lexical category (such as noun (Q1084), verb (Q24905), adjective (Q34698), and so on). This may be obtained using the wikibase:lexicalCategory[4] predicate:

?lexeme wikibase:lexicalCategory ?category

#title:Lexemes in English that are nouns
SELECT ?lexeme {
    ?lexeme dct:language wd:Q1860 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
}
Lexemes in English that are nouns

For an arbitrary set of lexemes, the lexical categories can be obtained, and the labels of those lexical categories may be displayed using the wikibase:label service:

#title:Lexemes in English and the names of their lexical categories
SELECT ?lexeme ?category ?categoryLabel {
    ?lexeme dct:language wd:Q1860 .
    ?lexeme wikibase:lexicalCategory ?category .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Lexemes in English and the names of their lexical categories

Querying senses

edit

Senses of lexemes can be obtained using the ontolex:sense[5] predicate:

?lexeme ontolex:sense ?sense

#title:Senses of noun lexemes in English
SELECT ?lexeme ?sense {
    ?lexeme dct:language wd:Q1860 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:sense ?sense
}
Senses of noun lexemes in English

Filtering definitions

edit

The individual free-text glosses on a lexeme sense may be obtained using the skos:definition predicate:

?sense skos:definition ?gloss

#title:Glosses from senses on Turkish noun lexemes
SELECT ?lexeme ?sense ?gloss {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:sense ?sense .
    ?sense skos:definition ?gloss .
}
Glosses from senses on Turkish noun lexemes

Sense glosses, like lexeme lemmata, consist of a string of text attached to a language code. When there are multiple glosses on a sense (to aid people speaking different languages visiting the lexeme page), you can filter the language code of the glosses using a string comparison with LANG():

FILTER(LANG(?gloss)="fr")

#title:Glosses (in English) from senses on Turkish noun lexemes
SELECT ?lexeme ?sense ?gloss {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:sense ?sense .
    ?sense skos:definition ?gloss .
    FILTER(LANG(?gloss)="en")
}
Glosses (in English) from senses on Turkish noun lexemes

Because it is not obligatory for a particular language's lexemes to all have glosses in a language (possibly different from the language of the lexeme), the query above may omit senses that do not have a gloss in that (possibly different) language. To still list senses that don't have a gloss in a particular language—that is, to not require that each sense in the result have a gloss in that language—the lines that obtain a gloss and filter it may be wrapped in an OPTIONAL block:

OPTIONAL { ?sense skos:definition ?gloss . FILTER(LANG(?gloss)="de") }

#title:Senses on Turkish noun lexemes, showing their English glosses if they exist and an empty string otherwise
SELECT ?lexeme ?sense ?gloss {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:sense ?sense .
    OPTIONAL { ?sense skos:definition ?gloss .
               FILTER(LANG(?gloss)="en") }
}
Senses on Turkish noun lexemes, showing their English glosses if they exist and an empty string otherwise

The same considerations for lexeme lemmata regarding string operations and lexicographical ordering operations also apply to sense glosses:

#title:Glosses (in English) containing "time" from senses on Turkish noun lexemes
SELECT ?lexeme ?sense ?gloss {
    ?lexeme dct:language wd:Q256.
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:sense ?sense .
    ?sense skos:definition ?gloss .
    FILTER(LANG(?gloss)="en")
    FILTER(CONTAINS(?gloss,"time"))
}
Glosses (in English) containing "time" from senses on Turkish noun lexemes

Getting forms

edit

Forms of lexemes can be obtained using the ontolex:lexicalForm[6] predicate:

?lexeme ontolex:lexicalForm ?form

#title:Forms of Turkish noun lexemes
SELECT ?lexeme ?form {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
}
Forms of Turkish noun lexemes

Filtering form representations

edit

The individual representations of a lexeme form may be obtained using the ontolex:representation[7] predicate:

?form ontolex:representation ?representation

#title:Representations from forms on Turkish noun lexemes
SELECT ?lexeme ?form ?representation {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form ontolex:representation ?representation .
}
Representations from forms on Turkish noun lexemes

Form representations, like lexeme lemmata and sense glosses, consist of a string of text attached to a language code. When there are multiple representations on a form (to handle multiple writing systems or other orthographic variation within a language), you can filter the language code of the representations using a string comparison with LANG():

FILTER(LANG(?representation)="azb")

#title:Representations (in the Ottoman script) from forms on Turkish noun lexemes
SELECT ?lexeme ?form ?representation {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form ontolex:representation ?representation .
    FILTER(LANG(?representation)="ota")
}
Representations (in the Ottoman script) from forms on Turkish noun lexemes

The same considerations for lexeme lemmata and sense glosses regarding string operations and lexicographical ordering operations also apply to form representations:

#title:Representations (in the Latin script) from forms on Turkish noun lexemes that precede "et" alphabetically
SELECT ?lexeme ?form ?representation {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form ontolex:representation ?representation .
    FILTER(STR(?representation) < "et")
}
Representations (in the Latin script) from forms on Turkish noun lexemes that precede "et" alphabetically

Filtering grammatical features

edit

The individual grammatical features of a lexeme form may be obtained using the wikibase:grammaticalFeature[8] predicate:

?form wikibase:grammaticalFeature ?feature

#title:Grammatical features of forms on Turkish noun lexemes
SELECT ?lexeme ?form ?feature ?featureLabel {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form wikibase:grammaticalFeature ?feature .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Grammatical features of forms on Turkish noun lexemes

The result of this query lists only one feature per form. It is possible to display the names of all grammatical features of a form together in one row, using the GROUP_CONCAT() operator and some modifications to the call to the wikibase:label service:

SELECT ... (GROUP_CONCAT(DISTINCT ?featureLabel; SEPARATOR=" / ") AS ?features) ...

#title:Grammatical features of forms on Turkish noun lexemes, where all features are displayed together
SELECT ?lexeme ?form (GROUP_CONCAT(DISTINCT ?featureLabel; SEPARATOR=" / ") AS ?features) {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form wikibase:grammaticalFeature ?feature .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". ?feature rdfs:label ?featureLabel }
} GROUP BY ?lexeme ?form
Grammatical features of forms on Turkish noun lexemes, where all features are displayed together

To search for forms which have multiple grammatical features, the wikibase:grammticalFeature predicate may be repeated for each feature on the form:

#title:Forms on German noun lexemes that are dative and plural
SELECT ?lexeme ?form {
    ?lexeme dct:language wd:Q188 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form wikibase:grammaticalFeature wd:Q145599 .
    ?form wikibase:grammaticalFeature wd:Q146786 .
}
Forms on German noun lexemes that are dative and plural

Alternatively, the features may be provided comma-separated to the wikibase:grammaticalFeature predicate, yielding an equivalent query to the above:

?form wikibase:grammaticalFeature wd:Q21714344, wd:Q110786, wd:Q192613

#title:Forms on Swedish noun lexemes that are genitive, definite, and singular
SELECT ?lexeme ?form {
    ?lexeme dct:language wd:Q9027 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme ontolex:lexicalForm ?form .
    ?form wikibase:grammaticalFeature wd:Q146233, wd:Q53997851, wd:Q110786 .
}
Forms on Swedish noun lexemes that are genitive, definite, and singular

Reading statements

edit

These queries demonstrate use of the most commonly added statements (on lexemes, forms, and senses) to discover new information within lexicographical data.

The logic surrounding the querying of statements and their parts on items and properties applies equally to the querying of lexeme, form, and sense statements.

Getting indirect relations between senses: "item for this sense" and "predicate for"

edit

The most frequently occurring statements on senses pertain to their relationship to corresponding Wikidata items for the concepts they represent.

In particular, the item for this sense (P5137) property links a substantive concept, such as the object represented by a noun or the quality or characteristic represented by an adjective, to the Wikidata item for that object, quality, or characteristic:

?sense wdt:P5137 ?item

#title:Values of "item for this sense" on Turkish nouns
SELECT ?item ?itemLabel {
    ?lexeme dct:language wd:Q256 .
    ?lexeme wikibase:lexicalCategory wd:Q1084 .
    ?lexeme wdt:P5137 ?item .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". ?feature rdfs:label ?featureLabel }
}
Values of "item for this sense" on Turkish nouns

Getting direct relations between senses: "synonym" and "translation"

edit

Sometimes correspondences to Wikidata items are not possible to establish with a lexeme sense, but correspondences between lexeme senses with the same meaning can be established.

When the senses are on lexemes in the same language, the synonym (P5973) property can be used to link them:

?sense wdt:P5973 ?synonym

#title:Synonyms of Nynorsk "dørsprekk" (meaning 'door crack') and their glosses
SELECT ?synonym ?gloss {
    wd:L1315743-S1 wdt:P5972 ?synonym .
    ?synonym skos:definition ?gloss .
}
Synonyms of Nynorsk "dørsprekk" (meaning 'door crack') and their glosses

When the senses are on lexemes in different languages, the translation (P5972) property can be used to link them instead:

?sense wdt:P5972 ?translation

#title:Translations of Hindustani "hain" (meaning 'to exist') and their glosses
SELECT ?translation ?gloss {
    wd:L993718-S1 wdt:P5972 ?translation .
    ?translation skos:definition ?gloss .
}
Translations of Hindustani "hain" (meaning 'to exist') and their glosses

Getting derivation information: "derived from lexeme" and "mode of derivation"

edit

Getting parts of larger lexemes: "combines lexemes"

edit

Getting pronunciation information: "pronunciation audio" and "IPA transcription"

edit

References

edit
  1. http://purl.org/dc/terms/language, from the Dublin Core Metadata Initiative
  2. http://wikiba.se/ontology#label, as a Wikibase-specific predicate
  3. http://wikiba.se/ontology#lemma, as a Wikibase-specific predicate
  4. http://wikiba.se/ontology#lexicalCategory, as a Wikibase-specific predicate
  5. https://www.w3.org/ns/lemon/ontolex#sense, from the Lemon Core module
  6. https://www.w3.org/ns/lemon/ontolex#lexicalForm, from the Lemon Core module
  7. https://www.w3.org/ns/lemon/ontolex#representation, from the Lemon Core module
  8. http://wikiba.se/ontology#grammaticalFeature, as a Wikibase-specific predicate