Wikidata:Lexical Masks

Lexical Masks: Introduction edit

General Principle edit

Lexicon masks aim at representing, in a consistent way, the expected internal structure of lexical entries. Masks are defined for each language and each part-of-speech in that language.

Lexical masks are specifications of the requirements a lexical entry should fulfill. In particular, a mask defines:

how many forms the entry should have to be complete;
what features are expected for each form.

Masks are specific to part-of-speech and language. One particular part-of-speech of one particular language can have more than one mask. For example, the table below shows the specification for Italian adjective entries. It specifies that four forms are expected, and each form should have one unique combination of gender and number features (i.e. there is one form for each feature bundles: MascGender / SingNumber, MascGender / PlurNumber, FemGender / SingNumber, and FemGender / PlurNumber).

	Masculine Gender	Feminine Gender
Singular Number	form1	form2
Plural Number	form3	form4

Of course, lexical entries can be (and often are) much more complex, both in terms of numbers of forms, but also in how the forms are being combined from the different available dimensions, in terms of the features used to describe these forms, and the entry in general.

Distinguishing entry-level and form-level features edit

Lexical entries are not only characterized by their forms and the features associated with the forms, but also by the feature assigned at the entry-level inherent to the entire entry. For example, the mask for Russian nouns below shows an entry-level specification that requires the combination of animacy and gender features at the entry-level, and a set of form-level features, specifying that each form must have a case and a number feature.

Entry-level	MascGender+Inanimate OR FemGender+Animate OR NeutGenderAnimate OR ...

Form-Level	Number=Sing	Number=Pau	Number=Plur
Case=Nom	form1	form10	form19
Case=Gen	form2	form11	form20
Case=Dat	form3	form12	form21
Case=Acc	form4	form13	form22
Case=Inst	form5	form14	form23
Case=Prep	form6	form15	form24
Case=Part	form7	form16	form25
Case=Loc	form8	form17	form26
Case=Voc	form9	form18	form27

Examples for entry-level features include gender and animacy for nouns, or aspect, transitivity for verbs and degree for adjectives.

Accounting for more granularity: multiple masks edit

The configuration of lexical entries must also provide a certain level of flexibility to account for different structures of different entries. For example, there are two masks for German nouns: the first mask, shown in the first table below concerns nouns that have an intrinsic gender (i.e. at the entry level) and all the case and number declensions of that noun. The second mask, given in the second table, describes the nouns that don't have an inherent gender at the entry-level but have specific inflections per gender (e.g. think of nouns for professions).

Mask 1 edit

Entry-level	MascGender OR FemGender OR NeutGender

Form-level	Number=Sing	Number=Plur
Case=Nom	form1	form5
Case=Gen	form2	form6
Case=Dat	form3	form7
Case=Acc	form4	form8

Mask 2 edit

	MascGender+SingNumber	MascGender + PlurNumber	FemGender+SingNumber	FemGender+PlurNumber
Case=Nom	form1	form5	form9	form13
Case=Gen	form2	form6	form10	form14
Case=Dat	form3	form7	form11	form15
Case=Acc	form4	form8	form12	form16

Using masks for lexicon validation edit

The mask model presented here is used to perform a semi-automatic evaluation of the lexicon. Each lexicon entry of a particular language (in the example an Italian adjectival entry) is ingested through the mask. During this process, we are checking that (1) this adjectival entry has indeed four forms, and (2) that each form has one of the required unique combinations of gender and number features (e.g. we cannot have two forms that are plural and feminine).

This evaluation process will mark all the entries that are passing the masks as structurally valid. The other entries that are not passing the masks will have to be looked at more carefully.

How does it work in practice? edit

The masks are formalized as JSON files (published on Github). From these JSON files, other artefacts can be created to be used in practise, such as ShEx files (published in Wikidata). A script available in that GitHub repository translates the JSON files to EntitySchemas.

In that example, on line 12, the SPARQL query is given to find all lexicographic entries the ShEx file applies to (all possible focus nodes for the shape described by the ShEX file). Below then, we see the description of the lexical entry: in line 22, we require the grammatical gender to be given at the entry level, and in lines 23ff. we see the definition of the eight individual forms that constitute a German noun as per the table above.

The validation will ensure that all required forms are present, that the right combination of grammatical features are given throughout the forms, and that all entry-level values are set as required. Furthermore, as usual with RDF, the validation will not prevent the data from having additional annotations and markers, e.g. it will not interfere with semantic annotations on the lexical entries, or linkages between entries from different languages. The ShEx files exclusively check for the completeness of the morpho-syntactic forms of the lexical entry.

To run the validation, go to a specific ShEx file (see list below), and click on the link on the right of the ShEx ("check entities against this Schema"). This will open a new window, with the ShEx validator script, and a field to run the command. In this field, used the command shown on line 22 of the script with some limits (to avoid the script to run for a too long time).

For example, for the German noun ShEx explained above, you can use the following command:

SELECT ?focus {?focus dct:language wd:Q188;wikibase:lexicalCategory wd:Q1084} limit 10

Then click on the button on the right ("validate") and you should see a list of Lexeme entry, with information about whether they pass the ShEx or not.

List of existing Masks edit

German
English
French
Italian
Russian

See on GitHub.

List of existing ShEx files edit

Basque
Bengali
- Bengali adjective
- Bengali adverb
Breton
- Breton lexeme
- Breton verb
Danish
English
French
- French standard noun
German
Russian
- Russian standard noun singular/plural
- Russian standard noun with paucal

There are more ShEx files in the GitHub repo. Feel free to copy them here.

Using Mask for Lexical Editing Form edit

Wikidata is developing its platform and infrastructure to support ShEx files in a wide range of use cases across Wikidata. Most importantly for us, we can use the files we publish to validate existing lexicographic entries. This allows for the large semi-automatic validation of the crowdsourced entries in Wikidata, and thus provides a feedback loop for the community to see the quality of their entries. They can get a generated list of all entries that do not fulfill the constraints described in the ShEx files, and then decide case by case how to handle the data (i.e. whether it is a valid exception, whether it requires an alternative or silver mask, or whether the entry needs to be improved).

We expect that the JSON files can be translated to form definitions easier.

Languages available so far:

Basque
English
French
German
Hebrew
Italian
Russian

Paper edit

The paper about lexical masks was published at LREC 2020: "Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons.", by Bruno Cartoni, Daniel Calvelo Aros, Denny Vrandečić, and Saran Lertpradit.

Frequently Asked Questions edit

What if I don’t agree with how a mask is set up for a particular POS/language? edit

It is possible that available masks are not accurate. The most important question is to know whether the proposed structure is not suitable for a particular set of words, or if it’s not suitable at all. If it’s the former (masks not suitable for a specific set of words), it is better to create another mask (a “silver” mask) that will probably be a subset of the specification designed in the existing masks. If it’s the latter, you can change the existing mask. But be careful, this will be applied to all the entries.

There is no mask for my language, how can I set one up? edit

We are happy to help! Please contact us on the talk page. What we often lack is the knowledge about a given language - but we have the expertise on how to create the specifications. We would love to cooperate, and specify the masks based on your knowledge.

Our suggestion is that you create two or three full entries for a given language and part-of-speech, exemplifying how a good entry would look like - and we will then try to capture that. Any additional explanations are welcome, and it would be great if you could be available for clarifying questions. If you have another idea on how to cooperate, please let us know.