Wikidata:Glossary for Wiktionary

This is an attempt to describe the necessary terms for a Wiktionary extension of Wikidata. The terms are roughly modelled after what is usual for Norwegian, with some adaptions for other languages, and some adaptions for common descriptions used in Natural Language Processing and Generation.[1][2][3] It is not a complete or sufficient description, nor is it in any way official or final, it is just a draft to list the terms and whether they are useful or creates confusion.

Basic terms

edit

All languages are built up of parts or building blocks. Some of those can be easily described by their sounds, and often those maps to letters, as glyphs in their written form, possibly also with accents, to create words or parts of words. In some languages it is not the individual sounds that are described but whole words. Those languages often use ideograms, but can have additional ways to describe the individual sounds. This glossary does not try to make any rules on how such languages should be described.

(Should have some note about use of emic units in general, and perhaps also tagmeme, taxeme and glosseme. Perhaps stay to KISS.)

A language (Norwegian) can be described from the smallest parts to the largest as (Some languages change meaning due to tone, this dosn't include this.)

  • Phone (sounds, physical properties), Phoneme (some sources say this is the smallest abstract class, other says grammatical properties of groups of phone), and Grapheme (the smallest unit used in describing the writing system of a language)
  • Morphemes (the smallest grammatical unit in a language)
  • Word classes (usually part of speech, but note that semantic, syntax and function can lead to different classes)
  • Word types (approximation from Norwegian)
  • Constituent (subordinate phrase, subordinate clause)
  • Phrase (sentence)

Because people approaches natural language from a lot of fields, such as neuropsychology, linguistics and the philosophy of language, various descriptions have been devised. But not only have people come from different fields, they have also had different needs when it comes to what kind of knowledge they must describe. The various kinds of knowledge they want to describe influences both our understanding of language and our description of those language, includes such things as[4]

  • Phonetics and Phonology — knowledge about linguistic sounds
  • Morphology — knowledge of the meaningful components of words
  • Syntax — knowledge of the structural relationship between words
  • Semantics — knowledge of meaning
  • Pragmatics — knowledge of the relationship of meaning to the goals and intentions of the speaker
  • Discourse — knowledge about linguistic units larger than a singe utterance

The various backgrounds and needs has triggered different types of descriptions, so that we now have both lexical descriptions, syntactic descriptions, semantic descriptions, functional descriptions, etc. Our description will be a simplified description that is geared towards the use in a Wikitionary extension within Wikidata.

Sounds and letters

edit

Letters are the names of the sounds, while the glyphs can be seen as the rendering of those names. In our glossary we use three main groups of sounds and letters (Note this is following Norwegian and we can have additional specialized sounds with their own letters in other languages, for example labiodental flap.)

  • Vowels — a speech sound that is articulated with open or partial open vocal tract
  • Consonants — a speech sound that is articulated with complete or partial closure of the vocal tract
  • Dipthongs — two adjacent vowel sounds occurring within the same syllable

Sounds can be grouped in syllables, and in some languages the consonants can be part of consonant gradation (for example Sami languages). Consonant gradation can then change syllables, and writing of the word according to word parts (features), given writing rules for the language.

Not all sounds we make while we are speaking has a simple rendering, especially those that are speech disfluencies or repairs. They are usually not part of a dictionary and we will not consider them further. Those are sounds like[5]

  • Discourse particles or Fillers ("But, uh, that was absurd")
  • Word fragments ("A guy went to a d-, a landfill")
  • Repetitions ("it was just a change of, change of location")
  • Restarts ("it's — I find it very strange")

Word parts

edit

Word roots and stems can be affixed with other morphemes to form new words. Usually the affixed morphemes carry meaning, but it can in some cases carry semantic meaning, or carry a very weak meaning. One example from English is "beloved, where "be" is a prefix with no special meaning.

  • Word roots — the part which can be extracted and is a real word in itself ("runing" — note that "run" is a valid word)
  • Word stems (multiple alternative interpretations) — the part of the word that does not include the affix ("disrupt", "corrupt", "rupture" — note that "rupt" is not a valid word)
  • Affixes
    • Prefixes ("unreadable", "misunderstood")
    • Suffixes ("highest", "smallest")
    • Infixes ("pipecoline" — English have very few infixes, mostly in technical literature, but more in tmesis)
    • Interfixes ("Arbeitszimmer" — from German, goes between two morphemes and have no semantic meaning)
    • Circumfixes ("gespielt" — from German, in this case goes around "spielen")
    • Transfix, Simulfix, Suprafix, Disfix

Word classes

edit

Word classes often goes under the name of part of speech, but what it should be called in this context is fairly open. It is not obvious what we try to model, but when the phrase "word classes" are listed in other contexts it is a fairly well defined list. Often lexical categories or other names are used interchangeably, but there can be small differences.

Common lexical categories defined by function may include the following (not all of them will necessarily be applicable in a given language):

Within a given category, subgroups of words may be identified based on more precise grammatical properties. For example, verbs may be specified according to the number and type of objects or other complements which they take. This is called subcategorization.

Word types

edit

There are three word types (This is modelled after Norwegian, it can be necessary to extend this to other languages. Norwegian language has both agglutinative and inflectional (fusional) parts, while for example English language is more of an isolating language.)

  • plain word — a word that consists of a single stem
  • compound word — the process of forming a new word on the basis of an existing word, often by addition of an affix
  • derived word — overlay one or more morphemes to denote grammatical, syntactic, or semantic change

Phrase

edit

A phrase is a group of words, or possibly even a single word, that functions as a constituent in the syntax of a sentence—a single unit within a grammatical hierarchy. A phrase appears within a clause, although it is also possible for a phrase to be a clause or to contain a clause within it.

Constituent

edit

A constituent is a word or a group of words that functions as a single unit within a hierarchical structure. The analysis of constituent structure is associated mainly with phrase structure grammars, although dependency grammars also allow sentence structure to be broken down into constituent parts. A constituent can include other constituents or phrases.

Lexeme

edit

A "Lexeme" holds a lemma (headword, catchword, canonical form, dictionary form, or citation form) that is customary written in capital letters, a reference to a language taken from the Items, and a reference to one or more part of speech taken from the Items. A lexeme is the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Each Lexeme have their own identifier so it can be referenced in other statements. Embedded in each Lexeme is a block of Forms and a block of Senses.

The name word type is a troublesome label for the lexeme. Usually this is either named as word class, morphological class, inflectional category, lexical category, lexical tag, POS, or part of speech.[6] The term lexical category refers in some contexts to a particular type of syntactic category, and may thus exclude parts of speech that are considered to be functional, such as pronouns. Entries in a lexical category is called lexemes and can thus also create confusion. The way Wiktionary uses lemmas are also confusing in itself as they are not always words but parts of words like -ing. (It seems like Wiktionary's use is closer to a catch all part of speech.)

For example, in the English language, run, runs, ran and running are forms of the same lexeme, conventionally written as RUN. Note that we reduce to the root word which is run and use that as source for the lexeme RUN. But then note -ing where there is no clear POS except affix and particle (the entry is about the particle, not its use). It should have a grammatical tag/category ofgerund.

There will be separate entries for each of the languages where the lexeme RUN is used, and for English there will be separate entries for each of the word classes Verb, Noun and Adjective.

A lexeme is often confused with an utterance, but there can be several different utterances for the same lexeme. That is; there can be homonyms.

Previous dev description

edit
Wikidata:Wiktionary/Development/Proposals/2015-05#Task 3: Lexeme entity type
Has a single Label (not per language as for Items), Language, Word type, and Statements, but no Description or Sitelinks.
Wikidata:Wiktionary/Development/Proposals/2013-08#Terminology
The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
The language of a lexeme is taken from Wikidata items, and thus an open set.

Form

edit

A "Form" is a single entity that describes a specific morphology of a word root or stem. (This creates problems when describing affixes.) Each one of them have their own identifier so it can be referenced in statements. All "Form" in a "Lexeme" is collected under a common section header "Forms".

All Forms should belong to the same word root or stem (This can create problems with interfixes.) It is not necessary to describe every morphological valid Forms, but sufficient number of the Forms should be provided for the Senses to be valid and complete. A Form can be provided without a Sense, but a Sense should not be provided without a defining Form.

Both label and representation is troublesome names for the string. Label is closer to the Wikidata lingo, but representation is slightly closer to the linguistic world. The usual name for this is the wordform,[7] which seems to be both a better description and more in line with what is commonly used. For some word classes this is also called principal parts. It is not clear what to call the label when a Lexeme covers affixes.

Both grammatical marker and lexical property are troublesome labels for the grammemes. Usually this is either named as grammatical category or grammatical feature.(Need reference) Note the possibility for confusion with lexical category which is used for part of speech.[8] Entries in a grammatical category is called grammemes.

For example, in the English language, a lexeme RUN will have the wordforms

  • runs — third-person, singular, simple present (these last three are the grammemes)
  • running — present participle
  • ran — simple past
  • run — past participle

Previous dev description

edit
Wikidata:Wiktionary/Development/Proposals/2015-05#Task 5: Form entity type
Has a (single, not per language) Label (monolingual text), Grammatical markers, and Statements, but no Description or Sitelinks.
Wikidata:Wiktionary/Development/Proposals/2013-08#Terminology
A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.

Sense

edit

A "Sense" is a single entity that holds a specific gloss, brief marginal notation of a wordform. Each one of them have their own identifier so they can be referenced in statements. All "Sense" in a "Lexeme" is collected under a common section header "Senses".

A Sense holds an explanation of the meaning of a word from one referenced Form. As such it must satisfy all language-specific constraints, typically given as grammatical features and a specific wordform, of the referenced Form. It must use the specific wordform as part of its gloss/annotation. By doing so a back-reference can be made from the Sense to a Form. All Senses that use a specific wordform can get automatic back-references to that Form (will not work globally), and all Forms with a matching wordform will get forward-references to those Senses (will not work globally). (The forward and backward references are possible for interlinear gloss, but then they must have their own identifiers.)

The gloss of a sense should be a interlinear gloss, which should be accurate, might not be easily translatable to other languages. For example snow in some Inuit languages are described with a lot more detail than is possible in Germanic languages. When we try to do a metaphrase (word-by-word) translation we loose information, while if we try to do a paraphrase translation we loose precision. We have one original and accurate gloss, and translations that should be marked as either metaphrases or paraphrases. Further examples on alternative ways to encode a interlinear gloss is found in the section Structure. (Note that there are a lot of alternatives!)

Previous dev description

edit
Wikidata:Wiktionary/Development/Proposals/2015-05#Task 6: Sense entity type
Has a Gloss (multilingual text, like a Label or Description for Items) and Statements, but no Label or Sitelinks.
Wikidata:Wiktionary/Development/Proposals/2013-08#Terminology
A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

Notes

edit

One hierarchy of descriptive tags could be

  • Lexical category (goes in Lexeme, this can be confusing)
  • Grammatical category (goes in Form)
  • Syntactic/Phrasal category (not in use now, would be in Sense)
  • Functional category (not in use now, where would this go?)

Another could be to avoid hinting at anything and use "tags" together with their role

  • Lexical tag (goes in Lexeme)
  • Grammatical tag (goes in Form)
  • Syntactic/Phrasal tag (would be in Sense)

… or simply call them "category"/"tags" and leave off the confusing part? It will open up for interpretation, and the users probably want some hints on what goes where.

Need some clarification on what kind of interlinear gloss are necessary. There are several types, these could be set up as statements, but that will blur the difference between lexical properties and semantic properties. Which opens the can of what to call the ordinary properties as semantic properties is very confusing.

Need field for annotation (description) of a Sense, if the gloss is a interlinear gloss it is not something easily used for a description. An annotation seems to be a multilingual field.

Need some way of handling derivations (inflections). It seems like this is a interlinear gloss on the Form, but this can also be handled as multilevel Senses. The later will be very confusing for the user to set up and interpret. The present proposal indicates that this should be handled by setting up several Forms.

Often a Sense can be attached to a main form of the lemma held in Form, that is sort of the easy linkage situation. Sometimes it seems like the derived (inflected) forms carry a sense of their own and then those derived (inflected) forms should connect to specific Senses, or really glosses belonging to a specific Sense that belongs to this Form. That means Forms and Senses connects and then derived (inflected) forms and glosses within them connects. Err…

Perhaps the problem is the scope of a Sense and the parent-child relationships…

Lexical statements are sometimes called collocations.[9] This is an easier catchphrase than lexical statements, but perhaps not as easily understandable. It could have some problematic connotations, as it is mostly (?) used for subtle and not-easily-explainable patterns of word usage.[10]

"Term" can create confusion, it is used about both words and phrases.

Proposal

edit
  • Add a separate RDF namespace for "lexical statements", with its own lexical properties (Note that "syntactic statements" isn't quite right, even if it would be fun!)
Lexeme

This adjusts and extends Wikidata:Wiktionary/Development/Proposals/2015-05#Task 3: Lexeme entity type

  • Change "lexical category" to "lexical tags" [resource pick list, multiple entries] (lexical category has a somewhat specific meaning)
  • Change "label" to "lemma" [monolingual, single entry]
  • Add "derivation" to the block for lexical statements [really snak(?)]
Form

This adjusts and extends Wikidata:Wiktionary/Development/Proposals/2015-05#Task 5: Form entity type

  • Change "lexical property" to "grammatical tags" [resource pick list, multiple entries] (lexical property creates confusion as it imply it is one level above)
  • Change "label" to "wordform" [interlinear gloss, single entry]
  • Add "inflection" to the block of lexical statements [really snak, resource pick list, multiple entries] (this is really "containment", it points to another Form)
Sense

This adjusts and extends Wikidata:Wiktionary/Development/Proposals/2015-05#Task 6: Sense entity type

  • Change "gloss" to "annotation" (or simply "description")
  • Add "syntactic tags" to the block for lexical statements [resource pick list, multiple entries] (uses tag to be conform)
  • Move "example" to the block for lexical statements [really snak] (now local statements, these are examples that could be in a container)
  • Add "subsense" ("minor sense"?) to the block of lexical statements [really snak] (this is really "containment", it points to another Sense)
Additional lexical properties

These can be added at any entity, but use at Lexeme imply they hold for all Forms and Senses

  • Add "hyponym" (narrower word/phrase) [this is a statement]
  • Add "synonym" (similar word/phrase) [this is a statement]
  • Add "hypernym" (wider word/phrase) [this is a statement]

There are probably more lexical properties.

Examples

edit

There are a few pages at Wikitionary that were used as my own examples while writing this page, especially

English
Norwegian

References

edit
  1. Håndbok i Norsk, Kunnskapsforlaget (1995) ISBN 82-573-0562-6
  2. Reiter, Ehud; Dale, Robert; Building Natural Language Generation Systems, Cambridge University Press (2000) ISBN 0-521-62036-8
  3. Jurafsky, Daniel; Martin, James H.; Speech and Language Processing, Pearson Education (2009) ISBN 0-13-504196-1
  4. Speech and Language Processing, pp 37-38
  5. Speech and Language Processing, p 391
  6. Speech and Language Processing, p. 157
  7. Speech and Language Processing, p 120
  8. Speech and Language Processing, p 422
  9. Firth, J. R.; A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. (1957) "Collocations of a given word are statements of the habitual or customary places of that word."
  10. Manning, Christopher D.; Schütze, Hinrich; Foundations of Statistical Natural Language Processing, p. 141 (1999)