Wikidata:Lexicographical data/Documentation/Languages/hi

Hindustani
natural language, modern language, common language
Subclass ofWestern Hindi Edit
Native labelہندوستانی, हिन्दुस्तानी Edit
Indigenous toPakistan, Delhi Edit
Linguistic typologysubject–object–verb, syllabic language, fusional language Edit
Has grammatical caseobliquus in Hindi Edit
Has grammatical genderfeminine, masculine Edit
Writing systemDevanagari, Urdu orthography Edit
Language regulatory bodyCentral Hindi Directorate Edit
Entry in abbreviations tableH., ਹਿੰ., ہ, ҳ. Edit

Hindustani (Q11051) or Hindi-Urdu is a language spoken in India and Pakistan. This page is a documentation page for Hindustani (Hindu-Urdu) language under WikiProject Wikidata:Lexicographical data, intended for coordinating contributions to Hindustani (Hindi and Urdu) lexeme content and related discussions. WikiProject India is a related Wikiproject that covers all Hindustani topics.

Example Hindustani lexeme entries:

Sample Lexemes by Lexical Category
verb noun pronoun adjective adverb postposition conjunction interjection determiner grammatical particle
आना/آنا (L33485) चूल्हा/چُولھا (L1011246) तुम/تُم (L580418) बहरा/بہرا (L640865) आगे/آگے (L580431) तक/تک (L409543) लेकिन/لیکِن (L580024) नमस्ते/نمستے (L579679) सब/سب (L620518) भी/بھی (L580358)

Wikidata:Lexemes aims to provide a CC0 licensed structured lexicographical data for everyone to use for different purposes, including for Wiktionary and the upcoming Abstract Wikipedia.


Layout

edit

Every lexeme entry has the following layout:

Lexeme-level

edit

The lemma of the lexeme can be considered a title or headword, generally the dictionary form of the word. It is to be written in both hi (Hindi, Devanagari script) and ur (Urdu, Arabic script) spelling variants for the Hindustani language entries. See उठना/اُٹھنا (L1071943) for example.

Every lexeme entry will have a lexeme ID (beginning with "L").

The language of the lexeme should be Hindustani (Q11051) in all cases (that is, not Hindi (Q1568) and not Urdu (Q1617)).

The lexical category should also be specified as broad as possible, and based on the Hindustani linguistic ontology.

Senses

edit

Senses represent different meanings of the same word.

Some statements that may be added to senses include image, item for this sense, translation, synonym, antonym, usage example, and more (see list). Note that for the translation, antonym, & synonym properties, the lexeme "sense ID" (LXXXXX-S1) of the target lexeme has to be copy pasted, not the lexeme ID.

Forms

edit

Forms represent different inflections (cases for nouns/adjectives, conjugations for verbs) of the lexeme (in both Hindi and Urdu spelling variants).

Each noun typically has four forms, for each combination of number (singular (Q110786)/plural (Q146786)) and case (direct case (Q1751855)/oblique case (Q1233197)). A small number of nouns which are often but not always animate also have vocative inflections (vocative case (Q185077)). These are governed by the senses on the lexeme and should not be added without certainty that they are used.

Structure and properties

edit

Common properties to be added for lexeme entries are given below:

Statements

edit

Identifiers

edit
  • Urdu Lughat ID (P11350) – aggregate online dictionary maintained by the Urdu Dictionary Board, a Karachi-based Pakistani government operation
Provided below is a key to some of the part of speech abbreviations used in the headings of entries. A key to those used in the footer for etymologies may be found in the menu on the Advanced Search page.
  • صف = صفت
  • امث = اسمِ مؤنث
  • امذ = اسمِ مذکر
  • ف ل = فعل لازمی
  • ف م = فعل متعدی
  • م ف = متعلق فعل

Senses

edit
See sense properties by usage

Forms

edit
See forms by grammatical feature

Spelling

edit

Below are some guidelines for resolving some irregularities in spellings between the two writing systems, particularly for words which may be poorly attested in one register or the other.

  • ष — in Sanskritized words borrowed via Bengali this is ش, otherwise it is کھ. Most words spelled with this letter post-partition are Bengali borrowings.
  • ज्ञ — in practice always گی. Some Urdu dictionaries contain spellings with نج under the assumption this cluster represents an independent sound in Hindi, but this does not reflect actual usage.
  • ऋ — is always رِ.
  • ण — is always ن.
  • पुर — word-finally, this is پور rather than پُر.
  • ऑ — this vowel is purely decorative and is best ignored even in Devanagari spellings. Most of its use is confined to distinguishing the abbreviation डाॅ॰ “Dr.”.
  • आँड़ — this sequence of a nasal and flap is typically written as نڈ in Urdu dictionaries and it is acceptable to pair these spellings together as the consonants represented by ڑ and ڈ are allophones in native Hindustani words. In English loanwords, only ڈ is realized in all positions, and in vocabulary loaned from Punjabi the positions of ڈ and ڑ is maintained in Urdu spellings as these sounds are not allophones in Punjabi.
  • त् — words spelled with this ending in standard Hindi are borrowings from Bengali words ending in ৎ. Although the virama/halant is retained when not followed by a suffix, it is removed in the oblique plural as in तों rather than त्ओं.
  • आँव — although انو may be found for this sequence in older Urdu writing, this is now more commonly written as اؤں.
  • य — word-finally, spelled with یہ in borrowings from Bengali, otherwise spelled with ے.
  • ژ — the value of this letter is always simply ज.
  • ہ — the use of this letter word finally is often arbitrary and unetymological. The word commonly spelled پتہ in Urdu is from Punjabi پتا rather than a Persian *پته. If both variants with this letter and ا exist they do not need separate lexemes. The lemma can follow whichever spelling is treated as the primary one in Urdu Lughat.
  • ق — some of the words spelled with this letter are native words which have been given pseudo-Arabic spellings, such as قُلی. The nukta form क़ is not necessary to represent this consonant which already had an ambiguous status in Persian. The /q/ phoneme represented by ق does not have phonemic status in Pashto either, and the spelling in Pashto onomatopoeic formations used in Hindustani like تڑق is an emphatic affect.

Maintenance

edit

To do

edit

Lexicographical Coverage

edit
See also: WD:Lexicographical data/Statistics
  • The lexeme forms coverage chart for Hindustani language is given below.
  • Forms in Wikidata: 7,630
  • Forms in Wikipedia: 54,443
  • Tokens: 18,734,831
  • Covered forms: 3,059 (5.6%)
  • Missing forms: 51,384 (94.4%)
  • Covered tokens: 12,434,459 (66.4%)
  • Missing tokens: 6,300,372 (33.6%)
  • Most frequent missing forms

Queries

edit
Main page: WD:Lexicographical data/Ideas of queries

1) Get all existing lexemes in Hindustani: query result

The following query uses these:

  • Items: Hindustani (Q11051)     
    SELECT ?lexeme ?lemma WHERE {
      ?lexeme dct:language wd:Q11051; 
              wikibase:lemma ?lemma.
    }
    

2) Get the count of lexemes in Hindustani belonging to different lexical categories: https://w.wiki/3$cf

3) Query for all Hindi/Urdu nouns missing a direct case: query

The following query uses these:

  • Items: Hindustani (Q11051)     , noun (Q1084)     , direct case (Q1751855)     
    SELECT DISTINCT ?l ?lemma WHERE {
      ?l a ontolex:LexicalEntry ; 
           dct:language wd:Q11051; 
           wikibase:lexicalCategory wd:Q1084; 
           wikibase:lemma ?lemma ; 
           ontolex:lexicalForm ?form .
        ?form ontolex:representation ?word ;  
        minus {
          {?l a ontolex:LexicalEntry ; ontolex:lexicalForm/wikibase:grammaticalFeature wd:Q1751855.}
        }.
    }
    

Resources

edit

Some resources, in addition to the ones listed below, may be found at Commons:Category:Books about the Hindustani language.

Dictionaries

edit

Quotable dictionaries

edit

Public domain dictionaries may be quoted using gloss quote (P8394), referenced with the claims stated in (P248) (appropriate dictionary item), page(s) (P304) (appropriate page number), and reference URL (P854) if applicable.

Public domain monolingual dictionaries (preferred):

Public domain bilingual dictionaries (those in other regional languages preferred):

More may be found here.

Citable, but non-quotable, dictionaries

edit

Other dictionaries that may be cited but not quoted include the following (those glossed in regional languages are likewise preferrable):

Phrases

edit

Grammars

edit

Orthography

edit

Regional Context

edit

Tools

edit
WD:Tools/Lexicographical data

Contact

edit


  1. 1.0 1.1 "In Hindi we have Shakespear and Forbes, but neither of these works is more than a very copious vocabulary, and both are derived almost exclusively from the written language."