Wikidata talk:Lexicographical data/Archive/2021/01

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Is there some tool to import lexemes from Wiktionaries into Wikidata?

Latest comment: 3 years ago3 comments3 people in discussion

For example - as far as I am aware - https://www.wikidata.org/wiki/Q19688659 has no declension info that is present on https://en.wiktionary.org/wiki/J%C3%B3zef

Is it OK to copy it into Wikidata ("Text is available under the Creative Commons Attribution-ShareAlike License" while wikidata is CC0)? Is there some existing tool for that if copying Wiktionary into Wikidata is OK? Mateusz Konieczny (talk) 05:49, 24 November 2020 (UTC)

ShareAlike means to keep the same licensing, so you can't copy the content of Wiktionary in batch. Some people consider that some data are not covered by the licensing and copy the information of the existence of lexemes and some grammatical facts. I am not aware of existing tools. Noé (talk) 15:48, 24 November 2020 (UTC)

@Yurik: did the import from Russian with LexData. I don't know if he published the scripts somewhere, Legally the words and forms can be copied if I understood the lawyerspeak correctly I found on meta on the matter, there has been no controversy since the Russian import. Wiktionaries are under US law and they have no database protection like in the EU/Sweden and forms of words are not copyrightable under US law (it is not considered original work or something like that).--So9q (talk) 19:04, 8 January 2021 (UTC)

Reverse Etymology, Outdated Senses, Deprecated Synonyms

Latest comment: 3 years ago3 comments3 people in discussion

So in the Lemon model SKOS example you can see the word consumption as being outdated compared to tuberculosis and as shown in the below example snippet. (And, indeed we already have `consumption` as an alias on tuberculosis (Q12204) to help find `tuberculosis` just in case a time traveler comes looking.)

:consumption_sense a ontolex:LexicalSense;
        ontolex:isLexicalizedSenseOf :tuberculosis;
        ontolex:usage [ rdf:value "outdated" ].

How can we handle this or plan to handle this in the Lexeme namespace on:

tuberculosis (L44789)

Would replaced synonym (for nom. nov.) (P694) be a part of the solution if constraints loosened on it? --Thadguidry (talk) 23:54, 8 January 2021 (UTC)

I would add a sense on consumption noun and link it to the qid tuberculosis. Then I would add a property to that sense indicating it is archaic if you got a source saying that. There is a property language use for that. Satisfied? --So9q (talk) 07:32, 9 January 2021 (UTC)

On tuberculosis (L44789), I think you should try to link to consumption (L6153) for "consumption", not to the alias of tuberculosis (Q12204).

Unlikely that taxonomist appreciate your repurposing of P694. --- Jura 08:15, 9 January 2021 (UTC)

Family names

Latest comment: 3 years ago6 comments3 people in discussion

Hmm, how would one best capture this fact taken from Wikipedia?

Keegan is an Anglicisation of the Irish clan name Mac Aodhagáin

with a Lexeme? Item? Both in concert? --Thadguidry (talk) 20:48, 7 January 2021 (UTC)

@Thadguidry: good question.

We definitely need at least one item in the current model (as family name (P734) need an item), and obivously it already exists: Keegan (Q21492676). This item could be linked to Mac Aodhagáin (Q6722188) (not sure how, with named after (P138) maybe? BTW this item need some checking).

We may also have Lexemes, linked together by derived from lexeme (P5191) with mode of derivation (P5886) = anglicization (Q540885) as qualifier and with senses linked to the corresponding items (with item for this sense (P5137)). And in bonus, you could also create the etymology of "Mac Aodhagáin" (composed of Mac + Aodhagáin, Aodhagáin itself coming from Aodh and so on, and everything with sources ideally).

Cheers, VIGNERON (talk) 12:18, 8 January 2021 (UTC)

Thanks so much for explaining fully, this makes sense to me as well. Hmm, Looking at the discussion of derived from lexeme (P5191), I'm wondering why there would not be a constraint to disallow based on (P144) statements on Lexemes? I notice the See Also's and derived from lexeme (P5191) is a way to sorta constrain a statement of based on (P144) but only for Lexemes, correct? To help avoid confusion for folks that might use based on (P144) on Lexeme's instead of using derived from lexeme (P5191) ? Yes/No ? --Thadguidry (talk) 13:06, 8 January 2021 (UTC)

@Thadguidry: based on (P144) is item-valued, derived from lexeme (P5191) is lexeme-valued, so they cannot substitute for one another. ArthurPSmith (talk) 18:32, 8 January 2021 (UTC)

@ArthurPSmith: Yes, I see that now, and also see the constraint item of property constraint (P2305) against Lexeme is applied. What's really puzzling are 2 things: 1. The history item for that is not seen here (I cannot tell who applied that constraint): https://www.wikidata.org/w/index.php?title=Property:P5191&offset=&limit=500&action=history and 2. The Data type = Lexeme is shown in a small box in the UI just above Statements but don't know how it got there for derived from lexeme (P5191) ? --Thadguidry (talk) 21:53, 8 January 2021 (UTC)

Data type is an intrinsic aspect of properties, and cannot be altered after the property is created. ArthurPSmith (talk) 15:30, 11 January 2021 (UTC)

Missing unicode symbols for proto-canaanite and proto-sinaitic

Latest comment: 3 years ago1 comment1 person in discussion

Hi, in my quest for creating all the derivations from the greek alpha all the way back to the hieroglyf 𓃾 I was surprised to find that unicode symbols seem to be missing for the proto languages mentioned in the title. See https://en.wikipedia.org/wiki/Proto-Sinaitic_script https://unicode-table.com/en/search/?q=aleph and https://en.wiktionary.org/wiki/%F0%90%A4%80#Phoenician

Any ideas what to do about that?--So9q (talk) 07:53, 12 January 2021 (UTC)

Suggestion for new namespace: usage_example

Latest comment: 3 years ago1 comment1 person in discussion

In the telegram chat https://t.me/joinchat/ICn09hkymb2dwpFKwGo5uA we recently discussed the need for storing translations of usage examples like Wiktionary does. Me and @Nikki: found that it's hard/impossible to do in a linkable way in our lexeme namespace with our current data model (our examples are stored as values under usage example (P5831)) and it affects navigation of the lexemes negatively.

I therefore propose a new namespace U dedicated to usage examples and their translations.

A first idea was to create a multilanguagetext field and statements below that. That would enable us to annotate the original language using a new property "original language" and referencing it. We should also add subject form (P5830) and subject sense (P6072) as statements to the usage example as statements. (The translations are just that, not demonstrating verifiable usage in other languages, thus they don't get to demonstrate anything among the lexemes).

Knowing the original language it is also possible to query in SPARQL all the examples that originate in a certain language using that property and find their translations. We then link to the example, e.g. U1, from the Lexeme namespace.

Pinging WMDE people to hear their reaction to this @Lydia Pintscher (WMDE), Lucas Werkmeister (WMDE):

WDYT? – The preceding unsigned comment was added by So9q (talk • contribs) at 08:22, January 12, 2021‎ (UTC).

New dashboard for lexicographical data statistics

Latest comment: 3 years ago3 comments2 people in discussion

Hello all,

Since the middle of December, we have a new dashboard for statistics about lexicographical data on Wikidata: Wikidata Datamodel Lexemes. Similar to other dashboards (e.g. Wikidata Datamodel Statements), it collects various statistics (in this case, using a collection of SPARQL queries). The data is similar to some of the results gathered at Wikidata:Lexicographical data/Statistics, and shows the evolution of data over time.

This dashboard highlights the efforts of the editors to improve the content by adding new Lexemes, Senses and Forms (for example the recent import of Lexemes in Estonian), but also areas where improvements could be made (83% of Lexemes without Senses).

We hope you will find this useful. If you have any feedback or notice an issue, feel free to contact us on this page. Thanks, Lea Lacroix (WMDE) (talk) 12:42, 12 January 2021 (UTC)

Nice. It could be interesting to add the number of lexemes where a sense is available on items (lemma = label or alias of item). --- Jura 12:55, 12 January 2021 (UTC)
Also, forms with or without attested in (P5323) should be tracked. --- Jura 16:29, 14 January 2021 (UTC)

Problem with using latin as lexeme language

Latest comment: 3 years ago2 comments2 people in discussion

Hi I have worked on recording all derivations from danish alfabet -> latin variants -> ancient greek -> phoenician -> ... since yesterday. I stumbled upon https://www.wikidata.org/wiki/Lexeme:L257129 and find it problematic as it uses Latin (Q397) as language. Latin seem to have undergone a lot of evolutions during history and I would prefer if we reflect that in out lexemes also.

See Old Latin (Q12289) all the way to Contemporary Latin (Q1246397) to get an idea of how many variants of Latin that has existed. Ping @uziel302: who created the latin letters.

WDYT?--So9q (talk) 07:46, 12 January 2021 (UTC)

So9q, every language has evolutions and it is hard to classify words by era, many words are common to multiple eras. You can see that I uploaded Whitaker's age categories and in the example you gave you can see Whitaker's Latin age type: late. Uziel302 (talk) 16:08, 16 January 2021 (UTC)

English noun ending with -s

Latest comment: 3 years ago15 comments5 people in discussion

here lists 541 English noun lemmas ending with -s; they are (indirectly) imported from WordNet. Some may be valid lemmas.--GZWDer (talk) 17:45, 15 January 2021 (UTC)

Some of these seem fine, except for the lack of forms or any other data on the lexemes. But for example surely airs (L316152) is just the plural of air (L1038)? Can there be any sense or etymology that is different from that for air (L1038)?? How should we fix these? ArthurPSmith (talk) 18:24, 15 January 2021 (UTC)

Pinging @Nikki: for further discussion. Note wikt:airs have some senses other than plural, however it may be debated whether a separate lexeme is needed.--GZWDer (talk)

Well, per wiktionary sense 2 on 'airs' is the same as sense 6 on 'air', which only states "usually in the plural", and I've certainly heard the phrase in singular. ArthurPSmith (talk) 19:19, 15 January 2021 (UTC)

They're your edits (except for 8 of them). It's not our job to tell you whether the lexemes you've added are valid and what they mean. If you don't even know whether they're valid or not, then I can only recommend that we delete the ones which are still empty. - Nikki (talk) 20:43, 18 January 2021 (UTC)

I'm in favor of deletion of empty lexemes, that is lexemes that only have the lemma, language and lexical category. They are simple too vague to be of any value. I'm gonna write a lexemecatcher that proposes unambiguous lexemes from speech and documents and guesses the forms based on lexeme forms for the user to approve. Then these empty one will be a problem because its one more decision to decide whether to merge or not, which we want to avoid.--So9q (talk) 22:40, 18 January 2021 (UTC)

Same, some of these lexemes may be saved but most are to delete, turn into redirects or split into several lexemes. If someone want to take a look and improve some of them, fine (not me, I already have a hard time fixing the empty lexemes for French) but otherwise, indeed it's probably better to just delete them all.

@GZWDer: all of them are valid lemmas but being a valid lemma is far from enough to be a valid lexemes (most of lemma are just forms so a part of a lexeme and not a form - like plural forms - and the other way round, in some cases, a lemma is shared be several lexemes). Why didn't you put the very minimum of data (at last the source).

Cheers, VIGNERON (talk) 12:59, 20 January 2021 (UTC)

Note I'm going to fix them though I am posting here for community inputs.--GZWDer (talk) 13:03, 20 January 2021 (UTC)

@GZWDer: Good then we can wait before deletion (if deletion is ever needed). Some suggestion of first fix: add at least one sense and one form (Lexemes should always have them) and merge lexemes when it's just a regular plural (pro-tip: to merge lexeme, you need to have the same main lemma first). Adding references would be needed too and having other claims would be much appreciated. Cheers, VIGNERON (talk) 13:40, 20 January 2021 (UTC)

@Nikki, So9q, VIGNERON: There are totally 281 lexemes that may be used as plural form of some noun, of which in the English Wiktionary:
1. 184 lexemes have a noun headword so they may be valid lexemes in their own
2. 44 lexemes does not have a lemma, but have some sense that is not used as a plural form: airs (L316152), auspices (L316560), bars (L316786), blahs (L317053), braces (L317310), briefs (L317391), buns (L317516), chilblains (L318004), cobblers (L318246), crabs (L318806), crossroads (L318905), diagnostics (L319411), diggings (L319450), follies (L320822), (L320998), funds (L321051), funnies (L321056), grits (L321482), guts (L321578), heaps (L321796), heights (L321832), hippies (L321924), hipsters (L321927), hoops (L322029), hysterics (L322216), liabilities (L323229), loins (L323389), megabucks (L323770), ninepins (L324477), oddments (L324682), pains (L324952), polls (L325670), prelims (L325858), prosthetics (L326033), respects (L326706), sands (L327110), services (L327532), shears (L327613), sights (L327771), snips (L328030), troops (L329644), viands (L330112), wages (L330233), waters (L330340)
3. 10 lexemes are used only as a plural form but there are no Wikidata lexemes in single form of the word so they are changed to single form
4. 43 lexemes are used only as a plural form, with a Wikidata lexeme in single form, so they may be deleted directly: (L316922), (L318178), (L318531), (L318709), (L319805), (L319867), (L319868), (L320415), (L320532), (L320708), (L321321), (L321486), (L321668), (L321918), (L322037), (L322780), (L322833), (L323269), (L323350), (L323421), (L323554), (L324071), (L324158), (L324409), (L324571), (L325572), (L325973), (L326191), (L326316), (L326584), (L326878), (L326976), (L327732), (L328147), (L328392), (L328418), (L329009), (L329015), (L329045), (L329192), (L329293), (L329579), (L330229)

Do you have any ideas about lexemes in group 2?--GZWDer (talk) 14:28, 20 January 2021 (UTC)

@GZWDer:

By « noun headword » do you mean « title of a Wiktionary entry », then I need to remind you that most Wiktionary entry are not Lexemes, Wiktionary and Wiktionary have two very different approach to lexicography on that front (one Wiktionary entry usually correspond to multiple lexemes and at the same time several Wiktionary entries correspond to only one lexeme ; for instance en:wikt:tour correspond to tour (L2330), tour (L2331), tour (L2332), tour (L6103), tour (L6104), tour (L42376) and menwhile en:wikt:tour and en:wikt:tours both correspond to tour (L2330), or tour (L2331), etc.). Plus, Wiktionary is a teritary source and not really a good reference (and didn't you say you used WordNet for the import?).
« have some sense », I see no sense on these Lexemes and most of them seems to be just regular plurals that should be merge, eg. hipsters (L321927) in hipster (L321926).
Didn't understood... Do you have an example?
Not sure to fully undertand, but by looking at the example it looks similar to #2 and it seems that they need to be merge, eg. (L316922) in belonging (L316921).

Cheers, VIGNERON (talk) 15:14, 20 January 2021 (UTC)

@VIGNERON: Example:

wikt:mechanics is a plural of a word, and also a lemma by itself (i.e. in wikt:Category:English nouns, which is a category that only include lemma forms). Both "mechanic" and "mechanics" would have a lexeme.
wikt:cobblers is not in wikt:Category:English nouns but have some senses by itself, so we need to discuss how to handle it.
wikt:alphanumerics is only defined as plural of wikt:alphanumeric. Since Wikidata did not have a lexeme for "alphanumeric" we just need to change the lemma.
Similar to #3, but we now have a lexeme for wikt:eyes and another for wikt:eye. We need to either merge them, or delete the lexeme for "eyes".--GZWDer (talk) 16:58, 20 January 2021 (UTC)

I checked more closely but most of these are not lexemes, I'm a bit dubious of wiktionary categorisation here. Some are indeed pluralia tantum but most are not. We probably need more sources to be sure (maybe by looking and crossing data with en:wikt:Category:English pluralia tantum at the very least, ideally by finding other sources as it's not unusual for a plurale tantum to have a singular anway, despite the very definition of plurale tantum...). 3. I don't like repurposing entities but for cases like alphanumeric (L316199), it's kind of ok. 4. yes, merge is the best here I guess. Cheers, VIGNERON (talk) 08:20, 22 January 2021 (UTC)

I had a look at cobbler and cobblers lexemes. We have for example these 3 ([1] [2] [3]) which are completely empty and created by you by bot I suppose (you did not add an edit summary it seems which is bad practice IMO). Please improve them with forms and senses and at least one reference on at least one form. If you are unsure about them, please add a deletion request yourself. I trust someone who has references comes by later and then they won't have to deal with this garbage.

If I don't see anyone improving all these empty lexemes created by your flood account before the end of February I plan to go through them and add a deletion request for every single one or improve them myself. By then you will have had 5 months to improve them (you started in october).

Empty lexemes without references are not acceptable in WD IMO, if we create lexemes semi-automatically from now on I would like at least one form with a reference. I am planning to write a new script/tool to harvest lexemes and it will of course not allow users to add empty unreferenced lexemes because that's just bad for everyone and steals valuable time from others having to delete the mess.--So9q (talk) 14:02, 23 January 2021 (UTC)

@So9q: The lexemes are imported from WordData (see example), most of which ultimately comes from WordNet (example). I am not able to add senses as they are not in free license. This also indicates many "lemmas" in WordNet are not lemmas at all.--GZWDer (talk) 17:33, 23 January 2021 (UTC)

Note WordNet does not contain "cobblers" as an interjection. It seems come from Collins English dictionary.--GZWDer (talk) 17:44, 23 January 2021 (UTC)

Wolfram WordData properties and statements on lexemes

Latest comment: 3 years ago1 comment1 person in discussion

Hi, I welcome you to join the discussion about WordData properties and statements on lexemes on this talk page: https://www.wikidata.org/wiki/Talk:Q105045991#I_suggest_moving_the_formatter_URL_to_a_relevant_property--So9q (talk) 15:51, 26 January 2021 (UTC)

The first bug triage hour will be about Lexicographical Data

Latest comment: 3 years ago1 comment1 person in discussion

Hello all,

As part of the improvements on the support process, we would like to find ways to involve better the community in the development of Wikidata. This is why, on top of the Wikidata and Wikibase office hour, we are experimenting with a new format: the Wikidata bug triage hour. It's an online event where Lydia, the product manager of Wikidata, publicly works on triaging development tasks (typically on Phabricator), improving their descriptions, defining their priority, and collecting the wishes and needs from the participants.

The first session will take place on February 16th and will be about Lexicographical Data. Join us and bring your favorite Phabricator task! You can find more details here, feel free to reach out to me if you have any questions. We're looking forward to chat with some of you there!

Cheers, Lea Lacroix (WMDE) (talk) 10:07, 28 January 2021 (UTC)