Wikidata:Lexicographical data/Documentation/Lemmata

The lemmata (singular lemma) of a lexeme are primarily used as human-readable representations of the lexeme. Usually lemmata are the written forms of a word, phrase, or affix that would be found in a dictionary describing them, whether or not they are considered the 'base' or 'stem' forms morphologically.

Each lemma consists of a string accompanied by a valid IETF language tag (or 'language code'). This code, which often coincides with the ISO 639-1 or ISO 639-3 code for a language (such as es or dag), may also contain subtags referring to a particular writing system (e.g. ja-hira), region (e.g. de-ch), or orthographic standard (e.g. pt-ao1990).

Lexeme lemmata are what are displayed when using the {{L}} template to link to a lexeme on Wikidata.

Examples edit

The English lexeme Lexeme:L1296 has the lemma 'tall' because most English dictionaries provide information about this lexeme under the heading 'tall' and not under something like 'taller' or 'tallest'.
The Bengali lexeme Lexeme:L308045 has the lemma 'খাওয়া' because most Bengali dictionaries provide information about this lexeme under the heading 'খাওয়া' and not under something like 'খাই', 'খা-', or 'খেতে'.
The Italian lexeme Lexeme:L1196895 has the lemma 'cantare' because most Italian dictionaries provide information about it under that heading and not under something like 'canto', 'cantante', or 'cantato'.
The Modern Greek lexeme Lexeme:L1098915 has the lemma 'πίνω' because most dictionaries of the modern Greek language provide information about it under that heading and not under something like 'πιω', 'πίνομαι', or 'πιει'.
The Korean lexeme Lexeme:L154 has the lemma '마시다' because most Korean dictionaries provide information about it under that form, rather than something like '마시-', '마셔', or even '마십니다'.
The Japanese lexeme Lexeme:L830 has the lemmata '為る' and 'する', even though the use of the 為 character is very unusual, because most Japanese dictionaries provide information about this lexeme under both headings.

Multiple lemmata edit

Lexemes can have several lemmata, particularly when there are differences in the writing system or other orthographic conventions within a given language. Different lemmata are indicated with different language tags, and a lexeme may only have one lemma for a given language tag (that is, there cannot be two lemmata on the same lexeme with language code da or two with language code gsg).

Examples of writing system differences edit

The Hausa lexeme Lexeme:L314793 has two lemmata, 'aboki' with code ha and 'أَبُوكِی' with code ha-arab, which are representations of the same dictionary form in the Latin script (used more generally) and the Arabic script.
The Hindustani lexeme Lexeme:L641622 has two lemmata, 'चाचा' with code hi and 'چاچا' with code ur, which are representations of the same dictionary form (pronounced /t͡ʃɑː.t͡ʃɑː/) in the Devanagari script (used for Hindi) and the Arabic script (used for Urdu).
The Japanese lexeme Lexeme:L572 has two lemmata, 'のむ' with code ja-hira and '飲む' with code ja, which are representations of the same dictionary form in either exclusively hiragana or the mixed script of Chinese characters, hiragana, and katakana.
The New Persian lexeme Lexeme:L742511 has two lemmata, 'دیدن' with code fa and 'дидан' with code tg, which are representations of the same dictionary form in the Arabic script (used for Persian in Iran and Afghanistan) and in the Cyrillic script (used for Tajik).
The Punjabi lexeme Lexeme:L679506 has two lemmata, 'ਪਿੰਡ' with code pa and 'پِنڈ' with code pnb, which are representations of the same dictionary form in the Arabic script (used in Pakistan) and the Gurmukhi script (used in India).
The Southern Min lexeme Lexeme:L308008 has three lemmata, '城市' with code nan-hani, 'siânn-tshī' with code nan-x-Q56929, and 'siâⁿ-chhī' with code nan-x-Q559173. These represent using either Chinese characters or one of two romanization systems, each corresponding to the same word form.
The Turkish lexeme Lexeme:L1171764 has two lemmata, 'yaşamak' with code tr and 'یاشامق' with code ota, which are representations of the same word form before and after the introduction of the Latin script to Turkish in 1928.

Examples of orthographic variation differences edit

The English lexeme Lexeme:L35013 has two lemmata, 'hemophilia' with code en and 'haemophilia' with code en-gb, reflecting a difference in spelling this word between different parts of the English-speaking world.
The Hebrew lexeme Lexeme:L63672 has two lemmata, 'אדום' with code he and 'אָדֹם' with code he-x-Q21283070, which reflect differences in how the same word form is spelt depending on whether diacritics are present.
The Portuguese lexeme Lexeme:L500697 has two lemmata, 'ciência' with code pt and 'sciência' with code pt-colb1945 reflecting differences in orthographic standard between (in the former case) Portugal, Brazil, and Cape Verde and (in the latter case) the rest of the Lusosphere.
The Esperanto lexeme Lexeme:L616380 has three lemmata, 'akuŝistino' with code eo, 'akusxistino' with code eo-xsistemo, and 'akushistino' with code eo-hsistemo, reflecting differences in how the circumflex diacritic is substituted when typesetting Esperanto using only ASCII characters.
The Belarusian lexeme Lexeme:L8880 has two lemmata, 'есці' with code be and 'е́сьці' with code be-tarask, reflecting differences in orthographic standard before and after reforms introduced in the territory of Belarus in 1933.

Handling lemmata language code uniqueness edit

Because it is not possible to add multiple lemmata to a lexeme that share the same language code, different strategies have been employed across different languages to deal with this problem. What works for the languages below may not be optimal for your own language: be sure to weigh these and other strategies before choosing one for your own language!

For Bokmål and Nynorsk, where the variation in spelling of a word is tied more to personal preference than to any particular standard, entirely different lexemes are created to deal with this variation, such as in Lexeme:L1219886 for 'kvalkjøt' and Lexeme:L1219887 for 'kvalkjøtt', both in Nynorsk.
For Southern Min, one pronunciation variation is treated as the lemma and the others as forms, such as in Lexeme:L306309 where the pronunciation 'muê' from one dialect is treated as a lemma and pronunciations from three other dialects are added as lexeme forms.
For Bengali, where a spelling differs based on its prescription by the language authority in Bangladesh or the language authority in West Bengal, this difference is indicated via a custom language code (see below) on the lemma using the QIDs for those language authorities' items, as in Lexeme:L308189.

Custom language codes edit

Some language codes used in lemmata may contain an '-x-' in them. There are two main reasons this would be present in a language code: 1) the desirable language code, while a valid IETF language tag, is unsupported or unsupportable in Wikidata, or 2) a variant of an existing supported language tag is unsupported or unsupportable in Wikidata.

Entirely unsupported language codes edit

For languages whose language codes are not yet supported or are not supportable, a last-resort option for a language code to use would involve adding a private-use subtag, containing the QID for the Wikidata item for the language, with the mis base code.

Lexemes in Torwali (Q2665246), such as Lexeme:L1003531, have a lemma with the code mis-x-Q2665246 (though the desired supportable code would be trw
Tracked in Phabricator
Task T314458
).
Lexemes in Soyot (Q4426878), such as Lexeme:L1015954, have a lemma with the code mis-x-Q4426878 (though no supportable language code exists).
Lexemes in Láadan (Q35757), such as Lexeme:L623039, have a lemma with the code mis-x-Q35757 (though the desired supportable code would be ldn
Tracked in Phabricator
Task T302705
).
Lexemes in Yaghnobi (Q34247), such as Lexeme:L684534, have a lemma with the code mis-x-Q34247 (though the desired supportable code would be yai).
Lexemes in Proto-Indo-European (Q37178), such as Lexeme:L638724, have a lemma with the code mis-x-Q37178 (though no supportable language code exists).

Unsupported variants of supported language codes edit

If a language has a supported language code, but a variation whose language code is not supported or supportable, the private-use subtag may be attached directly to the existing supported code.

Lexemes in the Varendri (Q48726757) of Bengali, such as Lexeme:L672268, have a lemma with the code bn-x-Q48726757 (where 'bn' is the existing supported code, but no supportable code substitute exists).
Lemmata in Devanagari Sindhi (Q116688933) for lexemes in Sindhi use the language code sd-x-Q116688933 (where 'sd' is the existing supported code, but the supportable code sd-deva exists
Tracked in Phabricator
Task T328603
).
Lemmata in the Adlam (Q19606346) for lexemes in Fula use the language code ff-x-Q19606346 (where 'ff' is the existing supported code, but the supportable code ff-adlm exists).
Lemmata in the Brolikva (Q113301414) system for lexemes in Brahui use the language code brh-x-Q113301414 (where 'brh' is the existing supported code, but the supportable code brh-latn exists
Tracked in Phabricator
Task T315999
).
Lemmata in the Mongolian (Q1055705) for lexemes in Mongolian use the language code mn-x-Q1055705 (where 'mn' is the existing supported code, but the supportable code mn-mong was rejected by the Language Committee for some reason).