User:SM5POR/Languages

Issues edit

Done Area Listed Issue Question or proposal Posted Resolution Resolved
Symbols 2020-06-06 TeX string (P1993) has a few "unique value" constraint violations, possibly related to the property descriptions in several languages referring to the "concept" rather than "symbol" expressed using the TeX string. Either remove the constraint, or make sure the affected concepts are provided with notation property pointers to the corresponding symbols to make the conflicting properties redundant.
Grammar 2022-01-28 Declaring each grammatical category (Q980357), such as case (Q128234), to be a subclass of (P279) grammatical category (Q980357) implies that the former item inherits a number of properties from the latter, including what it is an instance of (P31) (either defined explicitly by a claim for that item, or in turn inherited from its parent class). In effect, case (Q128234) itself becomes a class (a subclass) of grammatical categories, which it in reality isn't (it's a class of grammemes). The appropriate property to use with grammatical category (Q980357) is instance of (P31), as it breaks the chain of inheritance.
Ontology 2022-09-25 When words or phrases from one language or another end up as items in Wikidata Main namespace (due to Wikipedia articles being written about them, or for other reasons), they should not be confused with the concepts those words refer to. As an example, a curriculum (Q207137) is not a Latin phrase (Q3062294), but the English word "curriculum" is. Now, is Q90219924 a preposition (Q4833830) in the English language or a relation (Q930933) that may be written in different ways in different languages? Develop queries and methods to identify this kind of conflation, and write guidelines on how to avoid introducing such errors.
Semantics 2023-01-08 Senses require a large number of semantic items for interpretation. Employ qualifiers with item for this sense (P5137) to generate a more diverse effective set of target values.

Word/subject conflation edit

Identify anomalies edit

These items are likely to confuse properties of a subject with the properties of the word for this subject in one or more languages:

As I plan to demonstrate below, adpositions (prepositions, postpositions or circumpositions) without context aren't easily translated between different languages, as there is no one-to-one-mapping between the set of adpositions in a language and the semantic relations they denote.

in edit

The following analysis focuses on the English preposition Q90219924:

The item Q90219924 was created in April of 2020 and claimed to be an exact match (P2888) of the English and Russian lexemes in (L2987) and в/въ (L2109), respectively, but those (mutual) claims were soon removed (exact match (P2888) are probably not meant to be used with lexemes) and unidirectional item for this sense (P5137) links were left om the lexemes in their place. Later other properties were added, as well as more lexemes.

However, as almost any preposition typically has numerous different uses within its language, it won't easily map to a single item or translate to a corresponding word in another language. in (L2987) currently lists only two senses, described as "within" and "into" respectively, and they both link to Q90219924, turning that item into (!) a union of two senses (in contrast, the Russian lexeme в/въ (L2109) lists as many as 22 different senses). This is hardly how item for this sense (P5137) is supposed to be used, and in a dictionary a preposition may in reality have dozens of senses.

To test this, I composed a few sentences in English involving the preposition "in" and added translations for the languages to which the linked lexemes belong. The translations from English have been made by Google Translate, but I have verified (and corrected) the German and Swedish translations only. The Russian translations are verified by User:Infovarius. The Punjabi translations remain unverified.

English German Swedish Russian Punjabi Bengali Hindi
in (L2987) in (L6748) i (L35761) в/въ (L2109) ਵਿਚ/وِچ (L679728) মধ্যে (L595057) in (L2987)
I don't think we are in Kansas anymore. Ich glaube nicht, dass wir mehr in Kansas sind. Jag tror inte att vi är i Kansas längre. Я не думаю, что мы ещё в Канзасе. ਮੈਨੂੰ ਨਹੀਂ ਲੱਗਦਾ ਕਿ ਅਸੀਂ ਹੁਣ ਕੰਸਾਸ ਵਿੱਚ ਹਾਂ। আমি মনে করি না আমরা আর ক্যান্সাসে আছি।
The train will leave Princeton in half an hour. Der Zug verlässt Princeton in einer halben Stunde. Tåget kommer att lämna Princeton om en halvtimme. Поезд отходит из Принстона через полчаса. ਟ੍ਰੇਨ ਅੱਧੇ ਘੰਟੇ ਵਿੱਚ ਪ੍ਰਿੰਸਟਨ ਤੋਂ ਰਵਾਨਾ ਹੋਵੇਗੀ। ট্রেনটি আধ ঘন্টার মধ্যে প্রিন্সটন ছেড়ে যাবে।
War and Peace was originally written in Russian. Krieg und Frieden wurde ursprünglich auf Russisch geschrieben. Krig och fred skrevs ursprungligen på ryska. Война и мир изначально была написана на русском языке. ਜੰਗ ਅਤੇ ਸ਼ਾਂਤੀ ਮੂਲ ਰੂਪ ਵਿੱਚ ਰੂਸੀ ਵਿੱਚ ਲਿਖੀ ਗਈ ਸੀ। যুদ্ধ ও শান্তি মূলত রুশ ভাষায় লেখা হয়েছিল।
Yuri Gagarin became the first human in space in 1961. Juri Gagarin flog 1961 als erster Mensch ins All. Jurij Gagarin blev den första människan i rymden 1961. Юрий Гагарин стал первым человеком в космосе в 1961 году. ਯੂਰੀ ਗਾਗਰਿਨ 1961 ਵਿੱਚ ਪੁਲਾੜ ਵਿੱਚ ਜਾਣ ਵਾਲਾ ਪਹਿਲਾ ਮਨੁੱਖ ਬਣਿਆ। ইউরি গ্যাগারিন সর্বপ্রথম ব্যক্তি যিনি ১৯৬১ সালে মহাকাশ ভ্রমণ করেন।
There are 366 days in a leap year. Ein Schaltjahr hat 366 Tage. Det går 366 dagar på ett skottår. В високосном году 366 дней. There are 366 days in a leap year. 'অধিবর্ষে ৩৬৬ দিন'। 'लीप वर्ष में अधिक दिन'.

As should be illustrated by the table above, the English preposition "in" seems to correspond fairly well to the Punjabi postposition "ਵਿੱਚ" in its usage in these six different contexts (or senses), but gradually less so to the Russian, German, and Swedish prepositions ("в", "in", and "i" respectively). In Swedish, only the spatial "in" becomes "i", while the other senses are indicated by "på", "om" or simply no word at all.

Class trees edit

For this reason, I believe lexeme senses should be mapped (using the item for this sense (P5137) property) to different items depending on the exact semantics of those senses in their source language. These items may in turn be linked to each other using the subclass of (P279) property, thereby forming one or more class trees under relation (Q930933) and possibly other concepts. Here is an example:


Example of grammatical relation class tree

Given that we have the lexeme database, I doubt that we really need a Wikibase item for each lexeme that is specific to one language or another also in the Main Wikidata namespace, unless there are entries in other Wikimedia projects requiring such items. In those cases where an item currently serves a double purpose as a word and a sense, and it has never had any Wikimedia links, I would suggest removing the language-specific properties and attributes, resulting in a refined language-independent item describing a single sense only. As one of the aliases for map–territory relation (Q1963130) reads, the word is not the thing!

Grammar edit

grammatical category (Q980357) grammeme (Q2374489) Number of items Item examples
part of speech (Q82042)

Find grammars

Grammatical categories edit

The class of grammatical category (Q980357) may well be divided into sub-classes as the need arises, for instance to describe different kinds of grammar, such as those found in the Tamil language.

Grammar Grammatical categories Area of grammar
letter (Q9788)
word (Q8171)
Q20559207
Tamil prosody (Q19576072)
stylistic device (Q182545)

Lexemes edit

Word classes edit

Also known as parts of speech.

Reference used below: CODCE9 The Concise Oxford Dictionary of Current English, ninth edition (1995), part of Concise Oxford English Dictionary (Q2992058) series

Adpositions edit

Including prepositions, postpositions, and circumpositions.

English adpositions edit

These are mostly prepositions.

a edit
against edit
ago (postposition) edit
as edit
at edit
by edit
ex edit
for edit
from edit
in edit

CODCE9 identifies 23 different senses (plus 14 as an adverb and 3 as an adjective). See#in discussion below.

into edit

CODCE9 identifies 5 different senses.

of edit

CODCE9 identifies 10 different senses.

on edit
re edit
to edit

CODCE9 identifies 15 different senses (plus 2 as an adverb).

under edit
up edit
upon edit
vs edit

German adpositions edit

These are mostly prepositions.

a edit
à edit
ab edit
an edit
in edit

See#in discussion below.

innerhalb edit
je edit
nach edit
ob edit
um edit
zu edit

Spanish adpositions edit

These are mostly prepositions.

a edit
ante edit
bajo edit
con edit
de edit
en edit
hacia edit
hasta edit
so edit

Swedish adpositions edit

These are mostly prepositions.

à edit
an edit
av edit
för edit
för ... sedan (circumposition) edit
i edit

See#in discussion below.

om edit

See#in discussion below.

edit

See#in discussion below.

till edit
ur edit
åt edit
än edit

Lexeme properties edit

Find properties for lexemes edit

Find properties for lexemes

Find properties actually used with lexemes

Find lexemes with a rich set of properties

Find types of properties for which examples of using them on lexemes exist

Find redundant statements on items and their corresponding senses

Recommended property use edit

Difference between namespaces edit

Language-independent queries edit
SELECT DISTINCT ?subject ?subjectLabel ?category ?categoryLabel ?languages ?image ?video WHERE {
  {
    SELECT DISTINCT ?subject ?category (COUNT(DISTINCT ?language) AS ?languages) ?image ?video WHERE {
      #VALUES ?subject {wd:Q2}
      ?sense wdt:P5137 ?subject.
      ?lexeme ontolex:sense ?sense.
      ?lexeme wikibase:lexicalCategory ?category.
      ?lexeme dct:language ?language.
      #OPTIONAL {?subject wdt:P18 ?image.}
      #OPTIONAL {?subject wdt:P10 ?video.}
    }
    GROUP BY ?subject ?category ?image ?video
  }
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
}
Try it!
Language-dependent queries edit
SELECT DISTINCT ?subject ?language ?speech ?ipa ?writing ?image ?video WHERE {
  VALUES ?subject {wd:Q2}
  ?sense wdt:P5137 ?subject.
  ?lexeme ontolex:sense ?sense.
  ?lexeme dct:language ?language.
  ?lexeme ontolex:lexicalForm ?form.
  OPTIONAL {?form wdt:P443 ?speech}
  OPTIONAL {?form wdt:P898 ?ipa}
  OPTIONAL {?form ?wdtp ?writing}
  OPTIONAL {?sense wdt:P18 ?image.}
  OPTIONAL {?sense wdt:P10 ?video.}
}
Try it!

Model property proposals edit

While Wikidata property example for lexemes (P5192) offers suggestions for how to use a specific property in the lexeme domain, demonstrating how to combine multiple properties and other attributes when documenting a word may require a model lexeme, similar to the model item used to show how to design items in the Main entity namespace.

These proposals may be out of date, as there is now at least ̣̣̣̣̻a model lexeme (P11464) propertyˌ

  • Model lexeme
  • Model sense
  • Model form

Statements edit

Statement Model lexeme Model sense Model form
instance of (P31) Wikidata property (Q18616576) Wikidata property (Q18616576) Wikidata property (Q18616576)
described at URL (P973)
Wikidata item of this property (P1629) Wikidata model lexeme Wikidata model sense Wikidata model form
Wikidata usage instructions (P2559)
Wikidata property example (P1855) noun (Q1084) noun (Q1084) noun (Q1084)
inverse label item (P7087)
expected completeness (P2429) always incomplete (Q21873886) always incomplete (Q21873886) always incomplete (Q21873886)
related property (P1659)
property proposal discussion (P3254)

Constraints edit

Constraint Model lexeme Model sense Model form
subject type constraint (Q21503250) class (P2308)

relation (P2309)

class (P2308)

relation (P2309)

class (P2308)

relation (P2309)

allowed qualifiers constraint (Q21510851) property (P2306) property (P2306) property (P2306)
allowed-entity-types constraint (Q52004125) item of property constraint (P2305) item of property constraint (P2305) item of property constraint (P2305)
property scope constraint (Q53869507) property scope (P5314) property scope (P5314) property scope (P5314)

Lexeme statistics edit

Note: These statistics seem mostly redundant, as they are less extensive than the statistics gathered by the Wikidata Lexicographical project. I'm retaining this section anyway as a toolbox to be able to compare my numbers with those of the project and verify that I understand the lexeme structural relationships correctly, as well as to conduct some in-depth analysis of specific statistical quantities not described elsewhere.

Number of languages edit

Find languages with currently at least 10,000 lexemes

Number of lexemes, senses and forms edit

Updated 2022-09-04

Language Lexemes Senses Forms
Aragonese (Q8765) 10127 4 29290
Basque (Q8752) 22931 30737 1256971
Bokmål (Q25167) 17525 23346 118708
Czech (Q9056) 14196 5237 715522
Danish (Q9035) 14947 7526 66185
English (Q1860) 71660 28688 130461
Estonian (Q9072) 83208 55 2916037
French (Q150) 13784 8852 86541
German (Q188) 27498 9209 230588
Hebrew (Q9288) 29912 6029 451625
Indonesian (Q9240) 19685 71 412071
Latin (Q397) 32183 556 1198579
Malayalam (Q36236) 63316 11333 749411
Russian (Q7737) 101432 10697 1237781
Slovak (Q9058) 16475 959 235263
Spanish (Q1321) 21056 7042 281386
Swedish (Q9027) 36858 8708 254157
Ukrainian (Q8798) 15967 128 507567
All 909 languages 684223 218317 11171815

Update statistics for previously identified top languages

Update cross-language totals

Number of lexemes per lexical category edit

Find all lexical categories

Word classes (parts of speech) edit

Find main categories

Updated 2022-09-18

Language Categories Words Nouns Verbs Adjectives Numerals Interjections Adverbs Function words
Aragonese (Q8765) 6712 9 3405 0 0 0 0
Basque (Q8752) 14495 3968 277 0 41 21 10
Bokmål (Q25167) 11013 3406 2725 0 93 310 194
Czech (Q9056) 4992 290 4871 96 13 3276 194
Danish (Q9035) 8638 3546 1385 69 56 306 216
English (Q1860) 28431 7435 12506 42 264 20216 306
Estonian (Q9072) 60137 7932 9146 176 627 4436 754
French (Q150) 8444 1523 1765 251 17 573 103
German (Q188) 16225 3550 2710 243 319 2353 392
Hebrew (Q9288) 19748 4706 4269 26 29 107 131
Indonesian (Q9240) 6700 12782 173 1 1 2 15
Latin (Q397) 15885 6544 7307 124 99 1922 212
Malayalam (Q36236) 53387 3979 197 134 7 88 109
Russian (Q7737) 101096 56 60 26 7 20 40
Slovak (Q9058) 7037 3378 4001 145 56 816 406
Spanish (Q1321) 12253 3815 4178 0 8 223 89
Swedish (Q9027) 25979 4500 4007 60 28 908 150
Ukrainian (Q8798) 87 4 15830 2 0 0 3
All 921 languages 205 688820 436832 79222 85540 2817 1823 38195 4996

Update statistics for previously identified top languages; update cross-language totals

Function words edit

Find function word categories

Language Categories Function words Conjunctions Adpositions Particles Determiners Pro-forms Interrogative words
Aragonese (Q8765)
All 911 languages

Update statistics for previously identified top languages; update cross-language totals

Morphemes edit

Find morpheme categories

Language Categories Morphemes Affixes Roots Clitics Confixes
Aragonese (Q8765)
All 911 languages

Find speech recordings for lexemes edit

Find speech recordings for lexemes

Find rrelated language edit

Find languages belonging to a particular family

Map senses to items edit

Find items linked to senses

Find items covering potentially multiple senses

Expand the effective number of semantic target objects edit

The property item for this sense (P5137) exists to map each lexeme sense in any language to a single language-independent item identifying the semantic contents of the sense. The number of actual items is however unlikely to ever match the combined diversity of vocabularies from every language, for the following reasons, among others:

  • Due to the way dictionaries and encyclopedias (including Wikipedia) are written, most items correspond to and describe nouns, leaving few options for adjectives or verbs.
  • Even within the same part of speech and item corresponding to a sense, individual languages may have distinct lexemes for multiple aspects of the item not recognized in most languages, and therefore not represented in the item.
  • Some variation in vocabulary may be due to varying language style or level of education of the speaker or the intended audience.

Even when multiple items exist to match variation in a source language when doing a translation, the target language may lack the same nuances with respect to the item, rendering some words untranslatable.

One approach towards solving this problem involves adding qualifiers to the item for this sense (P5137) statement, resulting in an effective number of distinct statement values that is the product of the number of items and the total number of qualifier value combinations. Since item for this sense (P5137) typically links numerous languages and senses to the same item, the variation can be expected to appear on the lexeme/sense subject side of the statement, suggesting subject has role (P2868) as a suitable qualifier. Multiple aspects may be represented using different sets of qualifier value items:

  • Level of understanding
    • child level
    • general level (also default)
    • academic level
  • Socio-linguistiic context
    • slang
    • popular
    • professional
    • spiritual
  • Language style
    • casual
    • factual
    • formal
    • poetic
  • Grammatical context
    • possessive action
    • production
    • consumption
    • bringing
    • removing
    • sounding
    • has quality like

Instance of term considered harmful edit

Find instances of term that are probably conflations

Find homographs edit

Find declared homographs within each language

Find languages telling different kinds of events apart edit

Finf words referring to "events"

Find senses in English, German and Swedish edit

Find senses in English, German and Swedish

Find lexemes edit

Find lexemes of particular languages

Find lexemes of particular lexical categories

Not working yet

Broken query

Find subclasses of a given set of classes, listing additional clues for those without English labels edit

Not working yet

Broken query

Labels edit

Wikidata label statistics edit

Property labels in most languages per language family

Proper names edit

Compare number of unique names used across multiple languages edit

Compare number of unique names used across multiple languages

Finding classes of items other items are named after edit

Find classes of items other items are named after

Translation edit

Phonetics edit

Synthetic speech edit

Visual language edit

Symbols edit

Finding concepts with corresponding symbols sharing the same notational property edit

Find concepts with corresponding symbols sharing the same notational property

Typography edit

List usage of typeface used typeface/font used (P2739)

Writing systems edit

Finding ontological relations between writing systems, scripts, alphabets, and letters edit

Find ontological relations between writing systems, scripts, alphabets, and letters

Mongolian script edit