Wikidata:Lexicographical data/Documentation/Languages/hi

natural language, modern language, common language
Subclass of	Western Hindi
Native label	ہندوستانی, हिन्दुस्तानी
Indigenous to	Pakistan, Delhi
Linguistic typology	subject–object–verb, syllabic language, fusional language
Has grammatical case	obliquus in Hindi
Has grammatical gender	feminine, masculine
Writing system	Devanagari, Urdu orthography
Language regulatory body	Central Hindi Directorate
Entry in abbreviations table	H., ਹਿੰ., ہ, ҳ.

Hindustani (Q11051) or Hindi-Urdu is a language spoken in India and Pakistan. This page is a documentation page for Hindustani (Hindu-Urdu) language under WikiProject Wikidata:Lexicographical data, intended for coordinating contributions to Hindustani (Hindi and Urdu) lexeme content and related discussions. WikiProject India is a related Wikiproject that covers all Hindustani topics.

Example Hindustani lexeme entries:

Sample Lexemes by Lexical Category
verb	noun	pronoun	adjective	adverb	postposition	conjunction	interjection	determiner	grammatical particle
आना/آنا (L33485)	चूल्हा/چُولھا (L1011246)	तुम/تُم (L580418)	बहरा/بہرا (L640865)	आगे/آگے (L580431)	तक/تک (L409543)	लेकिन/لیکِن (L580024)	नमस्ते/نمستے (L579679)	सब/سب (L620518)	भी/بھی (L580358)

Wikidata:Lexemes aims to provide a CC0 licensed structured lexicographical data for everyone to use for different purposes, including for Wiktionary and the upcoming Abstract Wikipedia.

Layout

Every lexeme entry has the following layout:

Lexeme-level

The lemma of the lexeme can be considered a title or headword, generally the dictionary form of the word. It is to be written in both hi (Hindi, Devanagari script) and ur (Urdu, Arabic script) spelling variants for the Hindustani language entries. See उठना/اُٹھنا (L1071943) for example.

Every lexeme entry will have a lexeme ID (beginning with "L").

The language of the lexeme should be Hindustani (Q11051) in all cases (that is, not Hindi (Q1568) and not Urdu (Q1617)).

The lexical category should also be specified as broad as possible, and based on the Hindustani linguistic ontology.

Senses

Senses represent different meanings of the same word.

Some statements that may be added to senses include image, item for this sense, translation, synonym, antonym, usage example, and more (see list). Note that for the translation, antonym, & synonym properties, the lexeme "sense ID" (LXXXXX-S1) of the target lexeme has to be copy pasted, not the lexeme ID.

Forms

Forms represent different inflections (cases for nouns/adjectives, conjugations for verbs) of the lexeme (in both Hindi and Urdu spelling variants).

Each noun typically has four forms, for each combination of number (singular (Q110786)/plural (Q146786)) and case (direct case (Q1751855)/oblique case (Q1233197)). A small number of nouns which are often but not always animate also have vocative inflections (vocative case (Q185077)). These are governed by the senses on the lexeme and should not be added without certainty that they are used.

Structure and properties

Common properties to be added for lexeme entries are given below:

Statements

Identifiers

Urdu Lughat ID (P11350) – aggregate online dictionary maintained by the Urdu Dictionary Board, a Karachi-based Pakistani government operation

Provided below is a key to some of the part of speech abbreviations used in the headings of entries. A key to those used in the footer for etymologies may be found in the menu on the Advanced Search page.

صف = صفت
امث = اسمِ مؤنث
امذ = اسمِ مذکر
ف ل = فعل لازمی
ف م = فعل متعدی
م ف = متعلق فعل

Senses

See sense properties by usage

Forms

See forms by grammatical feature

Grammatical features
- Grammatical gender: masculine (Q499327) / feminine (Q1775415)
- Grammatical number: singular (Q110786) / plural (Q146786)
pronunciation audio (P443)
IPA transcription (P898)

Spelling

Below are some guidelines for resolving some irregularities in spellings between the two writing systems, particularly for words which may be poorly attested in one register or the other.

ष — in Sanskritized words borrowed via Bengali this is ش, otherwise it is کھ. Most words spelled with this letter post-partition are Bengali borrowings.
ज्ञ — in practice always گی. Some Urdu dictionaries contain spellings with نج under the assumption this cluster represents an independent sound in Hindi, but this does not reflect actual usage.
ऋ — is always رِ.
ण — is always ن.
पुर — word-finally, this is پور rather than پُر.
ऑ — this vowel is purely decorative and is best ignored even in Devanagari spellings. Most of its use is confined to distinguishing the abbreviation डाॅ॰ “Dr.”.
आँड़ — this sequence of a nasal and flap is typically written as نڈ in Urdu dictionaries and it is acceptable to pair these spellings together as the consonants represented by ڑ and ڈ are allophones in native Hindustani words. In English loanwords, only ڈ is realized in all positions, and in vocabulary loaned from Punjabi the positions of ڈ and ڑ is maintained in Urdu spellings as these sounds are not allophones in Punjabi.
त् — words spelled with this ending in standard Hindi are borrowings from Bengali words ending in ৎ. Although the virama/halant is retained when not followed by a suffix, it is removed in the oblique plural as in तों rather than त्ओं.
आँव — although انو may be found for this sequence in older Urdu writing, this is now more commonly written as اؤں.
य — word-finally, spelled with یہ in borrowings from Bengali, otherwise spelled with ے.
ژ — the value of this letter is always simply ज.
ہ — the use of this letter word finally is often arbitrary and unetymological. The word commonly spelled پتہ in Urdu is from Punjabi پتا rather than a Persian *پته. If both variants with this letter and ا exist they do not need separate lexemes. The lemma can follow whichever spelling is treated as the primary one in Urdu Lughat.
ق — some of the words spelled with this letter are native words which have been given pseudo-Arabic spellings, such as قُلی. The nukta form क़ is not necessary to represent this consonant which already had an ambiguous status in Persian. The /q/ phoneme represented by ق does not have phonemic status in Pashto either, and the spelling in Pashto onomatopoeic formations used in Hindustani like تڑق is an emphatic affect.

Maintenance

Recent Changes to Hindustani Lexemes
Search lexemes:

To do

Add the most frequent missing forms of Hindustani language in Wikidata LD.

Lexicographical Coverage

See also: WD:Lexicographical data/Statistics

The lexeme forms coverage chart for Hindustani language is given below.

Forms in Wikidata: 7,686
Forms in Wikipedia: 54,443
Tokens: 18,734,831
Covered forms: 3,084 (5.7%)
Missing forms: 51,359 (94.3%)
Covered tokens: 12,448,894 (66.4%)
Missing tokens: 6,285,937 (33.6%)
Most frequent missing forms

Graphs are temporarily unavailable due to technical issues.

Queries

Main page: WD:Lexicographical data/Ideas of queries

Hindustani Q-id: Q11051

1) Get all existing lexemes in Hindustani: query result

The following query uses these:

Items: Hindustani (Q11051)  
```
SELECT ?lexeme ?lemma WHERE {
  ?lexeme dct:language wd:Q11051; 
          wikibase:lemma ?lemma.
}
```

Try it!

2) Get the count of lexemes in Hindustani belonging to different lexical categories: https://w.wiki/3$cf

3) Query for all Hindi/Urdu nouns missing a direct case: query

The following query uses these:

Items: Hindustani (Q11051)   , noun (Q1084)   , direct case (Q1751855)  
```
SELECT DISTINCT ?l ?lemma WHERE {
  ?l a ontolex:LexicalEntry ; 
       dct:language wd:Q11051; 
       wikibase:lexicalCategory wd:Q1084; 
       wikibase:lemma ?lemma ; 
       ontolex:lexicalForm ?form .
    ?form ontolex:representation ?word ;  
    minus {
      {?l a ontolex:LexicalEntry ; ontolex:lexicalForm/wikibase:grammaticalFeature wd:Q1751855.}
    }.
}
```

Try it!

Resources

Some resources, in addition to the ones listed below, may be found at Commons:Category:Books about the Hindustani language.

Dictionaries

Quotable dictionaries

Public domain dictionaries may be quoted using gloss quote (P8394), referenced with the claims stated in (P248) (appropriate dictionary item), page(s) (P304) (appropriate page number), and reference URL (P854) if applicable.

Public domain monolingual dictionaries (preferred):

Public domain bilingual dictionaries (those in other regional languages preferred):

Urdu-Punjabi-Hindi dictionary (Q116459885)
Kangri Shabd Sangraha (Q116222955)
Masdar-e Fuyuz (Q117189077)
Hindi Punjabi Kosh (Q117189099)
A Dictionary of Urdu, Classical Hindi, and English (Q108916279)
Brice's Romanized Hindústánî and English dictionary
Fallon's new Hindustani-English dictionary (searchable at https://dsal.uchicago.edu/dictionaries/fallon/)
Forbes's dictionary, Hindustani and English^[1]
Shakespear's dictionary, Hindūstānī and English^[1] (reprinted in Lahore in 1980 as Dictionary, Urdu-English and English-Urdu; searchable at https://dsal.uchicago.edu/dictionaries/shakespear/)
Yates' dictionary, Hindustání and English
Q84072461

More may be found here.

Citable, but non-quotable, dictionaries

Other dictionaries that may be cited but not quoted include the following (those glossed in regional languages are likewise preferrable):

Hindko Urdu Lughat (Q115437685)
Sindhi-Urdu lughat (Q116740442)
Burushaski-Urdu Lughat (Q115929776)
Urdu Punjabi Lughat (Q65398900)
Pehli Waddi Saraiki Lughat (Q113960284)
Bahri's Learners' Hindi-English dictionary
Caturvedi's practical Hindi-English dictionary
Qureshi's Kitabistan's 20th century standard dictionary
Hindi-Chinese Kosh (Q113530710)
Tulsi Shabdsagar ([1] [2])
Manak Hindi Kosh (vol. 1, vol. 2, vol. 3, vol. 4, vol. 5)
Lughat-e Firozi (part)
Brajbhasha Sur-kosh (vol. 1, vol. 2)

Phrases

Grammars

Orthography

Urdu imla (Q115780460) – a comprehensive public domain work explicating the history of Urdu orthography

Regional Context

Tools

WD:Tools/Lexicographical data

Ordia
MachtSinn
Bodh - to add statements to lexemes, senses and forms.

Contact

Wikidata talk:WikiProject India can provide help with Hindi and Urdu related questions
User:Vis M

↑ ^1.0 ^1.1 "In Hindi we have Shakespear and Forbes, but neither of these works is more than a very copious vocabulary, and both are derived almost exclusively from the written language."

[beames-1] 1.0 ^1.1 "In Hindi we have Shakespear and Forbes, but neither of these works is more than a very copious vocabulary, and both are derived almost exclusively from the written language."

[1]