Wikidata:Lexicographical data/Universal Dependencies

This page serves as documentation for how Universal Dependencies relationships may be mapped onto Wikidata statements on lexemes.

(If you know what Wikidata lexemes are but have no idea what Universal Dependencies relationships are, read the motivation below and then read this walkthrough.)

Motivation edit

There exist lexemes out there which are composed of multiple parts; they include certain types of affixed words, compounded words, and both fixed and variable phrases. Each of the parts of these lexemes plays a role in the internal organization of the overall whole; there is usually some structure, however simple or complex, that governs how the parts of the lexeme fit together. One way to model this structure is through the use of dependency grammar relationships, where each part serves as a dependent of some other head part, and where these relationships eventually converge at a root part. The result is a sort of tree structure, similar in some ways to syntactic constituency trees (for many languages, individual phrases and subphrases may still be recognized as contiguous subtrees) but markedly different in others (phrases themselves are not an explicit level in the hierarchy).

It is hoped that the marking up of syntactic dependency relationships in Wikidata lexemes can improve those lexemes' usability in a natural language generation system, so that modifications of those lexemes—where other context (different subjects, objects, modifiers, and so on) within the syntax tree of a given sentence demand them—may be performed more effectively.

Syntactic dependency relationships may be indicated with the following two properties qualifying combines lexemes (P5238) statements on a lexeme:

  • syntactic dependency head position (P9764): position (as a value for series ordinal [P1545] on another combines lexemes [P5238] statement of the same lexeme) of the head to which the "syntactic dependency head relationship" value qualifying this combines lexemes [P5238] statement points
  • syntactic dependency head relationship (P9763): type of dependency relationship between the part being qualified and its corresponding head (denoted with "syntactic dependency head position")

Relationship to CoNLL-U edit

Existing Universal Dependencies syntax treebanks present their sentences in a revised version of the format described in CoNLL-X Shared Task on Multilingual Dependency Parsing (Q109242303). These sentences may for the most part be mapped to combines lexemes (P5238) statements on Wikidata lexemes. Some of the restrictions on certain data fields in CoNLL-U (that is, those whose description below involves choosing from a 'universal' set) have been discarded on Wikidata, so that the set of values that may be taken for those fields may be expanded as needed to better model a particular language.

The fields of a CoNLL-U line are mapped to parts of a combines lexemes (P5238) statement as follows:

  1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  2. FORM: Word form or punctuation symbol.
    • Use object form (P5548) as a qualifier to store this field. Note that punctuation symbols are not accepted as lexemes, so that any nodes for them should be omitted.
  3. LEMMA: Lemma or stem of word form.
    • Not necessary to explicitly mark as this should be inferred from the lemma(ta) on the target of the statement.
  4. UPOS: Universal part-of-speech tag.
    • Not necessary to explicitly mark as this should be inferred from the lexical category on the target of the statement. While there is no fixed set of part-of-speech items to use on lexemes, it is hoped that the set used across all languages remains generally small.
  5. XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag.
  6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension.
  7. HEAD: Head of the current word, which is either a value of ID or zero (0).
  8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
    • (Not possible to add here, since there is no integer-item tuple type for property values on Wikidata that would allow such head-deprel pairs to be kept together in the data[1].)
  10. MISC: Any other annotation.
    • (Use other qualifiers on the statement not noted above.)

Mappings of relationships edit

The list of mappings of dependency relations to Wikidata items used with syntactic dependency head relationship (P9763) has been moved to Wikidata:Lexicographical data/Universal Dependencies/Mappings.

  1. If one were so inclined to abuse Wikibase types, the 'quantity' datatype (consisting of an integer and an optional unit as a Wikidata item) could be abused for this purpose (and to handle the HEAD and DEPREL fields)