Wikidata:Lexicographical data/Documentation/Languages/zu

Zulu
language, modern language
Subclass ofNguni, Zunda Edit
Native labelisiZulu Edit
CountryLesotho, Mozambique, South Africa Edit
Indigenous toGauteng, Mpumalanga, KwaZulu-Natal Edit
Has grammatical casevocative case, locative case Edit
Has tensepresent tense, past tense, future tense Edit
Writing systemLatin script Edit
Language regulatory bodyPan South African Language Board Edit
Ethnologue language status1 National Edit
Wikimedia language codezu Edit

This page hosts information about the representation of isiZulu (the Zulu (Q10179) language) lexemes on Wikidata. Note that this page is subject to change as more information from various sources is accumulated and experience gained one what is the most convenient, yet still computationally easily usable, way of representing isiZulu lexical information.

There is also general documentation on adding and updating lexicographic data in Wikidata, including a diagram of the lexeme data model of the key components of an entry and what they mean.

Besides filling this page with content, you also may like to browse the isiZulu Wikipedia or read up on the language in the entry in Wikipedia about the Zulu language, and browse the Wikidata entries that are already linked to isiZulu using the 'What links here' feature.

Lexical category

edit

This list of categories is to be refined at a future date. See further below for guidelines to add more lexemes of the various categories to Wikidata.

Categories for words

edit

Categories for word parts

edit

Nouns

edit

As with all Niger-Congo B languages, isiZulu has a system of noun classes where each noun belongs to a noun class. There are different such classification systems. The one used most often is the one based on Meinhof, with extensions up to Canonici (i.e., also having classes 1a, 2a, 3a, and 9a). Alternative ones that one may find are Doke and Grout. To disambiguate, the annotation of a noun with the noun class (on the form, not on the lexeme) should be selected appropriately, such as "noun class 1 (Meinhof)" or "noun class 1 (Doke)". At the time of writing only Meinhof's up to nc17 was added to Wikidata.

The general default is to use the noun in the singular for the lexeme, except when it exists only in the plural form (e.g., amanzi). Then add that again under "Form" and select the grammatical feature to add the appropriate noun class. Since it is clear from the noun class whether it is singular or plural, that need not to be added. Then add the plural form (if applicable) and the noun class it belongs to. The singular/plural pairings of the noun classes for the updated Meinhof list of noun classes are: 1/2, 1a/2a, 3a/2a, 3/4, 5/6, 7/8, 9/10, 9a/6, 11/10, 14, 15, 17. This approach may not meet the full list of requirements for noun class information across NCB languages[1] in the best way, but seems to work for now.

Especially for new nouns, it may not be settled yet in which noun class it belongs (if not the default 5 or 6) and older documentation will not have 9a but may still list 16. The system allows annotation of a form with more than one noun class and omitting it altogether. The downside of either of those two choices is that it will cause a bug in any computational use of said noun. Therefore, if the data is in disagreement, please add sources and a note of clarification or motivation and hopefully it will be agreed upon before it's needed.

There are different types of nouns. Since most nouns are count nouns, one may adopt a default to state just noun and only annotate other types of nouns explicitly; notably, mass nouns (things that can be counted only in quantities, like water, gold, wood) and collective nouns (collectives of things, such as an electorate, a herd). For examples, see amanzi (L8426) for mass noun entry, umuntu (L37485) as count noun entry, and umphakathi (L700292) for a collective noun entry.

It is recommended to add a sense in some other language of choice to help other people understand what it's about. This is added under "sense". One can also link it to an Q-item in Wikidata with the property item for this sense (P5137).

One optionally may like to add as "Statement" the stem of the noun, using the property word stem (P5187). The elsewhere customary preceding dash should be omitted.

The noun prefixes are productive and added separately; see the section on prefixes below for details.

Word parts: affixes, clitics, and concords

edit

They are useful to add to Wikidata to the extent they are useful for generating whole words, i.e., are productive in some way. For instance, noun prefixes are also used for numbers (e.g., the 'ama' in engama-25), the clitics/concords, such as the subject concord for verb conjugation (e.g., u- to complete the stem -dla), and wh-questions (e.g. -phi). The CARP extensions in verbs, while strictly speaking affixes to the verb root, are not 'interesting' in that regard and would thus not be added separately to Wikidata.

Affixes

edit

The noun prefixes have been added in the same way as described for the concords below; see umu (L689510).

Other productive affixes, such as the wh-questions and locatives, are yet to be added at the time of writing.

Concords

edit

The clitics are better known as concords. There are numerous concords, such as relative, possessive, and adjectival concord. Since they are productive and needed for natural language generation to get Abstract Wikipedia working, they have to be added to Wikidata. Since the lexeme is supposed to be a lexeme and not a name of a concord, a slight workaround is used, as follows, and illustrated in engi (L688517) for the relative concord:

  • 1) pick the first entry from the list, which will be first person singular or noun class 1, and use that string for lexeme name;
  • 2) add all concords for each noun class as a form each;
  • 3) annotate each form with the applicable noun class as 'grammatical feature'.

There may indeed be multiple noun class annotations for a single form; that's fine. It may also be the case that a string is empty; this is still important to know, and then add the emptyset symbol (alike ø) as a form and annotate it with the noun class.

The lexical category of clitic (Q213458) may be refined to proclitic (Q108819306) or enclitic (Q6548647).

Verbs

edit

The verbs have been added by stem so far (not root nor infinitive) and without the dash; e.g., shaya (L677334) and shayela (L677332). Thus for 'eat' one would add 'dla', not -dla, -dl-, or ukudla. Also here it helps to add a sense to it, especially if there are multiple senses for the verb.

It is possible to specialise the type of verb, such as transitive verb (Q1774805) and intransitive verb (Q1166153). This is a statement on the lexeme level (in analogy with the type of noun).

Regarding forms, since isiZulu is highly inflectional, it is deemed not feasible to add all the forms of a verb, but they will be computed on the fly when needed. One may repeat the stem there as form. It is also possible to add particular forms, such as the imperative (then add grammatical feature 'imperative') or the stem under negation (final vowel = i).

One optionally may like to add as "Statement" the verb stem, using the property word stem (P5187) (the preceding dash should be omitted). Note that the verb root, the verb rad, and the verb stem are different things, but currently there are not enough properties in Wikidata for that. Graphically in this figure and textually:

  • verb stem = verb rad + final vowel
  • verb rad = verb root + CARP extension (i.e., any of is, el, an, or w)
  • verb root = the basic verb without the CARP extension

Examples:

  • bonana (to see each other) is a verb stem, where bonan is the verb rad, a is the final vowel, bon is the verb root, and an is the reciprocative in the extension.
  • bonisana (to show each other) is a verb stem, where bonisan is the verb rad, a is the final vowel, bon is the verb root, and the extension is made up of both is (causative) and an (reciprocative).
  • bona (to see) is a verb stem, where bon is the verb rad and a is the final vowel, and since the extensions is empty, bon is also the verb root.

Since we already record by verb stem, word stem (P5187) either can be ignored or repurposed for the root if the extension is not empty.

If you want to add how a verb is composed with the combines lexemes (P5238) property, then use the final vowel lexeme forms a (L740041) and, if applicable, any of the CARP extensions: is (L732948), el (L732943), an (L732945), w (L732952).

Adjectives

edit

In short: Record the stem (without the dash at the start), since it is formed into a word depending on the noun class of the noun it is an adjective of. Example: see, e.g., de (L705896).

More precisely, one may consider there to be two types of adjectives: 1) a small (closed) set of true adjectives (true adjective (Q65453883)) and 'adjectives' that are relatives. The latter used to default to 'relativity' relativity (Q983751) as preferred string in the editing interface but meanwhile a new item has been introduced for these 'adjectives' that are relatives: relative adjective (Q115388551). If you're not sure and the 'adjective' has not already been added as a true adjective in Wikidata (the 14 listed on Wikipedia have been added d.d. 29-11-2022), select relative adjective (Q115388551) as lexical category.

A key reason to choose explicitly either true adjective (Q65453883) or relative adjective (Q115388551) when adding a new lexeme is because in the natural language generation, the former takes the adjectival concord and the latter takes the relative concord.

Adverbs

edit

Record the word. Example: see ngamandla (L705899). To be considered in more detail on what else may need to be recorded.

Statements

edit

Statements are at the lexeme level (see, e.g., unyaka (L686326)). It is not immediately obvious from the interface which are permitted. Here's a selection


Forms

edit

The forms are currently mainly used for annotating the lexeme with, at least, the noun class and as a way to add the list of prefixes and concord. Among the options available, at least the following ones may be of use:

See also the lexical categories about suggestions on recording forms, above, notably for verbs, nouns, and adjectives.

Senses

edit

At least one sense should be added for each lexeme and, ideally, linked to at least one Q item.

Sometimes this is non-trivial or there are no 1:1 mappings to items or other lexemes in Wikidata. One then can add a description in the sense field (see, e.g., ihawozi (L688559)) and leave it at that, or add more senses if sources do not agree. It may also happen that the informal sense description is not exactly the same but sufficiently synonymous with the label of the item (e.g., umfazi (L677330) may be translated as married woman or wife).

The main properties that can be used on a sense are:


Resources

edit

Note: these lists are incomplete and will be extended over time.

Lexicographic data

edit
  • isiZulu Wiktionary
  • Free online Zulu-English dictionary isiZulu.net
  • Selection of isiZulu concords in Wiktionary and more concords
  • Dent, G.R. and C.L.S. Nyembezi. Scholar's Zulu Dictionary. Pietermaritzburg: Shuter & Shooter. Revised version of 2010
  • De Schryver, G.-M., et al. Oxford Bilingual School Dictionary: Zulu and English / Isichazamazwi Sesikole: isiZulu - isiNgisi. New (2010) or revised (2015).

Grammar resources

edit

Organisations

edit

Other

edit


References

edit
  1. Keet, C. Maria; Khumalo, Langa; Mahlaza, Zola (April 25, 2022). Considerations for a model for NCB noun classes in Wikidata (PDF). WikiWorkshop 2022. online.