Wikidata:Lexicographical data/Documentation/Forms

Forms describe the different realizations of a lexeme in speech or writing.

There are four components of a lexeme form, described in each of the following sections:

its form ID;
its representations;
its grammatical features; and
its statements.

Number of forms on a lexeme

Depending on how a language behaves morphologically, and depending on the size of the particular lexical category to which a lexeme belongs, there may be exactly one form of a lexeme or there may be multiple forms.

In general, the more isolating or analytic or the more agglutinative or polysynthetic a language is, the more it may benefit from having one form per lexeme. Lexemes in many fusional languages typically have multiple forms for particular combinations of grammatical features.

Some examples of the considerations made for particular languages are listed below. What works for one language may not work for another, no matter how closely related they may be: this judgment has to be made on a case-by-case basis based on how your language works!

Cantonese (Q9186) is a highly analytic (and with respect to inflections, nearly isolating), language. Words are typically not modified directly to indicate grammatical changes, instead being flanked on one side or another by particles and other clearly separable modifiers. (電視 (L1207246) would always appear this way within a Cantonese sentence describing a television, for example.) As a result, Cantonese lexemes typically have just 1 form.
Swedish (Q9027) is an analytic language with some fusionality. Nouns are generally realized in one of 8 ways based on number, definiteness, and case (nominative or genitive), and the indicators of these features are not readily separable from the rest of the word—see, for example, klimat (L46694). Verbs can also be realized in one of 11 ways depending on tense or whether auxiliary verbs are present, and those indicators too are not always cleanly separable—see, for example, anpassa (L46942). As a result, most Swedish lexemes can have multiple forms, but still only a few in any case.
Bangla (Q9610) is a more fusional language with some aspects of agglutination. Nouns typically are only realized in 4 or 5 ways depending on case (indications of number, if necessary, are clearly separable); হৃদয় (L301993) is a great example. Verbs are able to be realized in almost 100 different ways due to combinations of person, tense, aspect, and formality level—দেখানো (L642917) comes close to this number. While some portions of verbs can be separated somewhat—particularly those pertaining to aspect and person—verbs themselves are a smaller and more closed lexical category; it is not as easy to turn a word in Bengali into a verb as it is to turn that word into a noun, so verbs do not proliferate in Bengali as much as they may in some other languages. As such, while nouns typically have either 4 or 5 forms, it is alright for verbs to have around 100 different forms.
Turkish (Q256) is a rather agglutinating language. Inflectional and derivational morphemes are almost always clearly separable—for nouns, verbs, and adjectives alike—and differences in how those morphemes are realized, usually due to phonological concerns, are indicated on the morphemes themselves. As a result, Turkish lexemes (except for derivational and inflectional affixes) can get away with having only one or two forms.
While no specific polysynthetic languages have a large enough body of well-developed Wikidata lexemes yet, they are likely to also be able to get away with having only a few forms per lexeme for roughly the same reasons as in Turkish.

Form ID

The form ID starts with the ID of the lexeme it belongs to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g. L3746552-F4. These IDs are unique within Wikidata; when a new form is created within a lexeme, an entirely new form ID is provided for it. Like an LID or a sense ID, a form ID may be appended to http://www.wikidata.org/entity/ to form a unique URI for the form.

Form representations

Form representations are strings, accompanied with language tags, that signify how a particular form is used.

As with lemmata, there may be multiple representations on a single form to handle differences in writing system or orthographic variation within a language.

Many of the decisions made with regard to lexeme lemmata are also applicable to individual forms, particularly when it comes to having multiple form representations on a form or using custom language codes on form representations.

Grammatical features

Grammatical features are references to Wikidata items that define the syntactic circumstances in which a given form applies. There may be any number of grammatical features attached to a form, but which ones should be added depends on how your language works and the environments in which that form may appear.

The German noun form Geschichten (L296234-F5) is marked as being used when that noun is in the nominative case (Q131105) and is plural (Q146786).
The Hindustani verb form कीजिए/कीजिये/کیجیے (L579999-F42) is marked as being used in imperative (Q22716) situations (with a suggestive (Q113115684) air) when its subject is second person (Q51929049) plural (Q146786) and when the verb is used in particular phases.
The Bokmål adjective form største (L303993-F7) is marked as being used when treated as a superlative (Q1817208) and applied to a definite (Q53997851) noun.

Form statements

Like lexemes, senses, items, and properties, forms can have statements further describing the form and its relations to other forms and to Wikidata items.

Statements about form pronunciation

pronunciation audio

pronunciation audio (P443) can be used to link a form to a pronunciation of it on Commons. A form can have multiple pronunciation files, but Wikidata does not try to link all available files for a word on Commons.

If there isn't a pronunciation file available yet, you may want to record one yourself. There are many ways to do that, including by using Lingua Libre.

IPA transcription

The International Phonetic Alphabet (Q21204) is the main phonetic alphabet used for lexemes.

The user script User:Nikki/LexemeAddIPA.js can be used to make it easier to add IPA transcriptions to multiple forms.

Other phonetic alphabets

Sources sometimes provide phonetic transcriptions using other phonetic alphabets. These properties can be used to add that information to Wikidata, especially when the IPA equivalent is not available.

UPA transcription: The Uralic Phonetic Alphabet (Q1287368) is a phonetic alphabet often used for Uralic languages, and sometimes also for other languages such as Turkic or Mongolic languages.
Slavistic Phonetic Alphabet transcription: The Slavistic Phonetic Alphabet (Q9338538) is a phonetic alphabet sometimes used for Slavic languages.

pronunciation

Statements about form orthography

hyphenation

The user script User:Lucas Werkmeister/hyphenation-point.js can be used to replace pipe characters (|) with the hyphenation point character (‧, U+2027 HYPHENATION POINT).

Wikidata:Lexicographical data/Documentation/Forms

Contents

Number of forms on a lexeme

Form ID

Form representations

Grammatical features

Form statements

Statements about form pronunciation

pronunciation audio

IPA transcription

Other phonetic alphabets

pronunciation

Statements about form orthography

hyphenation

Statements about form provenance

described by source

attested in

Statements regarding relationships to other forms

alternative form

homophone form

Statements about form environments

appears before phonological feature

appears after phonological feature

appears before lexeme form

appears after lexeme form

Other generic form statements

instance of

variety of lexeme, form or sense

has characteristic