Wikidata:Lexicographical data/Documentation
![]() | This documentation page is currently being reworked. Some important changes may occur. |
This is the main documentation page for lexicographical data on Wikidata.
See also the technical documentation on extension WikibaseLexeme.
IntroductionEdit
Data ModelEdit
The data model of WikibaseLexeme describes the structure of the data that is handled as "Lexemes" in Wikibase. The text below is a summary; for more detailed information, see Extension:WikibaseLexeme/Data Model.
A Lexeme is a lexical element of a language, such as a word, a phrase, or a prefix (see Lexeme on Wikipedia). Lexemes are Entities in the sense of the Wikibase data model.
From a high level the Lexeme hierarchy is modeled like so:
- Lexeme (ID)
- Lemmas
- Forms
- Senses
A Lexeme is described using the following information:
- An ID. Lexemes have IDs starting with an "L" followed by a natural number in decimal notation, e.g.
L3746552
. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Lexeme. - The Language to which the lexeme belongs. This is a reference to a concrete Item, e.g. English (Q1860).
- The Lexical category to which the lexeme belongs. This is given as a reference to a concrete Item, e.g. adjective (Q34698).
- Lemma (plural Lemmas) for use as a human readable representation of the lexeme, e.g. "run" or "when pigs fly". A Lexeme can have several lemmas, even though it is rare, but sometimes common for multi-script languages, for example, सूर्य/سُوریہ (L476308) or ama/𒂼 (L1).
- Note: the script of the Lemma items are indicated by the language script code, which should be a valid IETF language tag, although the current design is likely more restricted than the full spec. Do NOT over-specify, especially when confusion is not at all possible.
- For uncoded languages, use
mis-x-Q[...]
to refer to its item ID. Similarly, uselangcode-x-Q[...]
for coded languages written in an uncoded script. The software currently rejects anything written after the first QID, so you will not be able to describe what script an uncoded language is written in.
- A list of Lexeme Statements (claims) to describe properties of the lexeme that are not specific to a Form or Sense (e.g. derived from or grammatical gender or syntactic function)
- Usage examples of Lexemes demonstrate a certain Sense of a word, as well as a Form, so should be made as a statement on the Lexeme itself and with qualifiers such as:
- Statement: usage example (P5831)
- Add qualifier: subject form (P5830)
- Add qualifier: subject sense (P6072)
- Add qualifier: language style (P6191) with values colloquial language (Q901711) or formal register (Q104597585) (optional)
- Statement: usage example (P5831)
- Usage examples of Lexemes demonstrate a certain Sense of a word, as well as a Form, so should be made as a statement on the Lexeme itself and with qualifiers such as:
- A list of Forms, typically one for each relevant combination of grammatical features, such as 2nd person / singular / past tense. A Form is described using the following information:
- An ID. Forms have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g.
L3746552-F7
- A representation, spelling out the Form as a string.
- A list of grammatical features that define for which syntactic role the given form applies. These are given as references to a concrete Items, e.g. participle (Q814722) for participle.
- A list of Form Statements further describing the Form or its relations to other Forms or Items (e.g. IPA transcription (P898), pronunciation audio, rhymes with, used until, used in region)
- An ID. Forms have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g.
- A list of Senses, describing the different meanings of the lexeme (e.g. "financial institution" and "edge of a body of water" for the English noun bank). A sense is described using the following information:
- An ID. Senses have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g.
L3746552-S4
. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Sense. - A Gloss, defining the meaning of the Sense using natural language.
- A list of Sense Statements further describing the Sense and its relations to Senses and Items (e.g. translation, synonym, antonym, connotation, register, denotes, evokes).
- An ID. Senses have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g.
This data model is further extended by the set of properties typically used for Lexeme statements, Form statements, and Sense statements. See Wikidata:Lexicographical data/Properties for an overview of these properties and Wikidata:Property proposal/Lexemes for current proposals of additional properties.
Sample lexemes by language and lexical categoryEdit
In some cases or languages, there may be multiple entities for related words, in others just one. The below table provides an overview how they may be linked:
difference in | 1 lexeme | 2+ lexemes | |||
---|---|---|---|---|---|
sense | add several senses | add applicable sense to lexeme | link other(s) with homograph lexeme | duplicate forms on each | |
etym. | add etym. to each sense | add etym. to lexeme base | link other(s) with homograph lexeme | duplicate forms on each | |
gender | add gender to each sense | add gender to lexeme base | link other(s) with homograph lexeme | duplicate forms on each | |
common/proper | add several senses | use lexical category "noun" | add applicable sense to lexeme | link other(s) with homograph lexeme | duplicate forms on each |
caps/lowercase | add several forms | qualify forms to applicable senses | add applicable sense to lexeme | link other(s) with homograph lexeme | add only applicable forms |
singular/plural | add several forms | qualify forms to applicable senses | add applicable sense | if possible link other(s) with homograph lexeme | add only applicable forms |
pronunciation | add the same form twice | qualify forms to applicable senses, add prononciation | add applicable sense | if possible link other(s) with homograph lexeme | add form and applicable pronunciation |
forms/spelling | add several forms or alternate forms | qualify forms to applicable senses | add applicable sense | if possible link other(s) with homograph lexeme | add only applicable forms |
For a given language and criterion (first column), just one of the two might apply
InterfaceEdit
LexemeEdit
Create a new LexemeEdit
- Go to Special:NewLexeme
- Enter a lemma (dictionary form of a word) — Lemma
- Enter the language of the lexeme by typing the name of the language or Q-ID — Lexeme's language
- In the field that appears above, enter the language code of the lemma — Spelling variant of the Lemma
- Enter the lexical category by typing its name or the Q-ID (example: verb, noun, adjective...) — Lexical category
- Click on "Create"
- The Lexeme is now created with this basic information, you can continue editing it
Edit a LexemeEdit
- Click on the edit button, next to the lemma
- Edit the content of the different fields
- Lemma
- Language code of the lemma — Spelling variant
- Language of the Lexeme — Language
- Lexical category
- Click on "publish"
Add, edit or delete statements of a LexemeEdit
- To add a statement of a Lexeme, click on "add statement"
- Enter a property: start typing its name in the property field (example:
derived from lexeme
) and select it in the suggester - Enter a value.
Note: A Wikidata property for lexicographic senses (Q54275340) such as translation (P5972) or synonym (P5973) does not currently support value search results for senses by Lexeme name. That means in order to enter a value for a statement, you need to enter the precise Lexeme Sense ID for the Lexeme Sense you want as a value. For example, mother (L3625) has the statement synonym (P5973) mom (L11530). Entering
is the only way this value can be published.L11530-S2
As seen here, Wikidata will not be able to find Lexemes and their senses when searching by their name. Searching by a precise Lexeme Sense ID however returns a publishable result. - Just like on Items, you can add qualifiers and references
- Save by clicking "publish"
- To edit a statement, click on "edit"
- To delete a statement, click on "edit", then "remove"
Delete a LexemeEdit
- Go to WD:RFD
Search for a LexemeEdit
Here's how you can look for Lexemes, Lemmas, Forms or Senses, via Special:Search or the search box on any page:
- look for a lexeme by its L-number
- by typing "Lexeme:L123"
- by typing "L123" and selecting the Lexeme namespace
- look for a Lexeme by the name of its lemma
- by typing "Lexeme:sandbox"
- by typing "sandbox" and selecting the Lexeme namespace
- use the L shortcut: "L:L123" or "L:sandbox"
- look for a Form: (eg "Lexeme:mangeant") with any of the methods described above
Note that the selector (drop-down menu popping up to suggest results) is not working yet. But if you press Enter or search after typing your keyword, you'll access the results.
FormEdit
Create a new FormEdit
- In the Forms section, click on "add Form"
- Fill the representation — Representation (mandatory)
- Fill the language code of the representation — Spelling variant (mandatory)
- Enter one or several grammatical features, by typing their name and selecting them in the list of items — Grammatical features
Edit a FormEdit
- Click on the "edit" button next to the representation
- Modify the content in the fields
- Click on "publish"
Delete a FormEdit
- Click on the "edit" button next to the representation
- Click on "remove"
Transliterations (Scripts/Phonetics)Edit
- New subpage link to be added here (proposed by on mailing list by Thadguidry (talk) 04:29, 13 December 2020 (UTC))
SenseEdit
Create a new SenseEdit
- In the Senses section of a Lexeme, click on "add Sense"
- Enter a language code (for example: en, fr, zh) — Language
- Enter a gloss (very short phrase defining the meaning)(equivalent to: skos:definition) — Gloss. NOTE:If a gloss is quoted or citable from a source, then use gloss quote (P8394)
- You can add new glosses by clicking on "add"
- Click on "publish"
Translations, Synonyms, etc.Edit
For each Sense there can be many Sense statements made to not only other Senses, but also to Items through translations, synonyms, antonyms, connotations, register, evokes, usage examples, refers-to-concept, etc.
This is shown on the colored visualization of the Lexeme Data Model svg image above.
Edit a SenseEdit
- Click on the "edit" button, next to the Sense ID
- Edit the content of the different fields
- Click on "publish"
Remove a SenseEdit
- Click on the "edit" button, next to the Sense ID
- Click on "remove"
FeaturesEdit
See also: Wikidata:Lexicographical data/Development
What is included in the first versionEdit
- New datatypes: Lexeme, Form
- Add, edit, delete Lexemes
- Add, edit, delete Forms
- Add, edit, delete statements
- Add, edit, delete qualifiers
- Add, edit, delete references
- Linking to an Item from a Lexeme or a Form
- Linking to another Lexeme from a Lexeme, a Form or an Item
- Search and suggestions when entering a value
- Basic internal APIs (used for UI, you should not use them)
What will be added in the futureEdit
Ordered from near to long-term plans
- Search for content with Special:Search Done
- Display the lemma in the history pages, recent changes and watchlist Done
- Add, edit, delete Senses Done
- RDF support and ability to query the data on query.wikidata.org Done
- Better API support
- Automatic generation of Forms
- Data access on clients (other Wikimedia projects)
- Editing data directly from Wiktionary