Wikidata talk:Lexicographical data/Archive/2020/05

Latest comment: 3 years ago by Lexicolover in topic Notability of languages
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Abrreviation as separate lexeme or not?

See kilomètre (L19811) (has "km" as form). --So9q (talk) 10:19, 28 April 2020 (UTC)

  • In some languages (pl?), they are separate lexemes to allow addition of the usual 100? forms present on any lexeme. --- Jura 10:36, 28 April 2020 (UTC)
I think this would strongly depend on language. When I was thinking how would I deal with abbreviations in Czech language I did not make any final conclusion. Generally I would follow some self-imposed rules with reasoned exceptions: 1) Abbreviations that act and are read as any other word (=acronyms?) should be separate lexemes; 2) abbreviations that are meant for writing only and are usually read unabbreviated (ex.; b.; etc.) should not be considered separate lexemes, IMO some kind of (new?) property would be best for them but listing them as forms might work as well; 3) Symbols should not be considered separate lexemes and some kind of (new?) property would be best for them (and I would not list them as forms). --Lexicolover (talk) 20:57, 1 May 2020 (UTC)

CEFR language competence level for lexeme

Hi! For many languages, there are so called "word lists" published, which a person, whose knowledge competence corresponds to a particular level (say, A2), is expected to know. I think the community would benefit a lot, if those lists could be imported into WikiData Lexemes as well, so on the lexeme page we could see which CEFR (or any other scale) level will this word correspond to. But I could not find any suitable property. What do we think on this? Could a property be added, and what should the ontology be?  – The preceding unsigned comment was added by 62mkv (talk • contribs).

I have been wondering how best to mark that a word appears in the Goethe Institute's word lists for German. I think rather than a specific property for CEFR or level, I would use something more general, such as "on word list", with a link to an item like "Goethe Institute B1 word list" which would have information about who published it, when and where. That way we could link to non-CEFR word lists too. - Nikki (talk) 10:11, 5 May 2020 (UTC)
Sounds good to me.--So9q (talk) 16:15, 8 May 2020 (UTC)

Described by Wiktionaries

Hello,

It seems complicated to use wikibase:sitelinks to a Wiktionary, so I suggest to use P1343 to indicated when a form is described in a Wiktionary and when a sense is described. I made a test to say that the sequence of letters chat is described in French Wiktionary, and the meaning associate with a pet is also described in French Wiktionary. Do you think it is acceptable? Another strategy may be to create a serie of property similar as P7829 (for Wiktionaries instead of Vikidia) to add links to the dedicated pages. Noé (talk) 07:52, 18 April 2020 (UTC)

@Noé: This seems reasonable to me, but it would be nice to also include a link to the specific Wiktionary page referred to, perhaps with a reference URL (P854) reference statement, or maybe there's a qualifier that would work? ArthurPSmith (talk) 15:16, 20 April 2020 (UTC)
I don't have any preference, but I'll be please to include also links to the wiktionaries pages and to the meanings (with anchor to a definition when available). If a Sense is also connected to a Qid with a Wikipedia page via P5137, we may be able to have one or more links to Wiktionary in the side menu on the Wikipedia page, next to the other projects. That may definitively prove Lexeme is answering to the initial project of Wikidata:Wiktionary, helping Wiktionaries Noé (talk) 15:34, 20 April 2020 (UTC)
I love the idea of linking Lexemes to Wiktionary (this was indeed one of the orignal goal after all), and vice-versa (that's exactly why I built the gadget mentioned here: #Gadget to link Wiktionary and Wikidata). This is definitely an idea worth looking deeper into it and doing the things right. So for me, it's a big YES, this is acceptable and even desirable!
In practice, I'm not sure what is the best way, it could be described by source (P1343) (alone or with qualifier, eg. reference URL (P854)), it could be a linking property, or anything else. I'm also wondering what is the best place, described by source (P1343) is often used on main Lexeme level (again alone or with qualifiers, for instance subject form (P5830)) ; in the test, it on both sense and form. It make sense but not sure if it's really optimal.
If we put on form level and/or indicate the form as qualifier, then there is no need to explicitely store the link as the link is trivial to built (it could probably even be done by javacript inside the Wikidata interface itself) and even without this precision, most of the times, the main lemma can be used by default. In fact, it seems to me a more reliable and elegant way to do it than using reference URL (P854) and even more than a solution like English Vikidia ID (P7829). Caveat: it may depends both on the granularity wished (linking to the page itself or to an anchor in the page) and on the wiktionary targeted as each wiktionary has a different structure, but on the other hand, it allows the reuser to built as they wish  .
Cdlt, VIGNERON (talk) 16:59, 20 April 2020 (UTC)
Hey, thanks to VIGNERON live contribution yesterday, I was able to write a query to have all French entries that have a connection to a Qid with associated with a Wikipedia page : https://w.wiki/Qhc
I am quite happy to see 500+ pages. I am wondering now if it could be a good sample to do a proof of concept for the idea exposed up here. So, it could be great if someone have an idea on how to check if those pages exists in French Wiktionary and then automatically add this statement to those entries. Then, when the last column of the query is fulfilled (now it is only filled for "chat", you can search for it in the results), it could be possible to develop a gadget to indicate in Wiktionary the relate Wikipedia page, and in Wikipedia to indicate the related French Wiktionary pages. It could be nice! Noé (talk) 08:45, 13 May 2020 (UTC)
@Noé: the gadget I spoke earlier #Gadget to link Wiktionary and Wikidata could be adapted for your purpose. You should contact @Darmo117: who helped me built this gadget for the tech part and obviously I can help for the SPARQL part. Cheers, VIGNERON (talk) 09:28, 13 May 2020 (UTC)

How broad should the senses reach?

Hi, I stumbled upon this lexeme today https://www.wikidata.org/wiki/Lexeme:L58286. It has senses not only covering the sense in the language of the lexeme but the similar concept in other countries. I imagine we could continue down this road and add senses for the similar concept in all countries that have it. But is that really what we want?--So9q (talk) 03:59, 9 May 2020 (UTC)

  • @So9q: I have been generally NOT adding country-specific senses even where Wikidata has items - military ranks are a very common case, for example lieutenant has both general items and specific items for "British Army and Royal Marines", "french military", "Canadian Armed Forces", "Royal Navy", "Starfleet" etc. I would advocate for only adding the general meanings as senses, the specific ones really don't add anything significant to the meaning. ArthurPSmith (talk) 14:38, 11 May 2020 (UTC)
  • @So9q: When adding senses to Russian lexemes I run into the same problem (especially with military ranks too). I tend to do the same way as ArthurPSmith. --Infovarius (talk) 02:57, 15 May 2020 (UTC)
  • Sorry to break the consensus but I disagree a bit  . For me, specific senses seems both useful and necessary in a lot of cases. Despite having the same name and etymology, some senses can cover very different reality (maybe it should even be separate Lexemes? probably not but I wonder...). For instance canton (L18778), in Switzerland a "canton" is a very big administrative unit (similar to a region or a state in the United States) while in France, a "canton" is a very small administrative unit (similar to a county or more often to a quarter of a city). That said, the structure in place on socken (L58286) is maybe not the best (and by the way, I'm notifying @Vesihiisi: who is the best to talk about this lexeme she created  ) and maybe we can find a better as it's indeed not truly separated senses but more "derivated" senses. Cheers, VIGNERON (talk) 14:42, 15 May 2020 (UTC)

Wikidata:WikiProject Lexicographical Data ?

Hello,

Is there a team of lexicographers hidden somewhere? Did the people adding lexicographical data had gathered already around a place where initiatives and personal projects are discussed? And finally, is there is a logo for this team/group/bunch of people or for Lexicolovers in general? I was looking for a userbox icon saying the interested for Lexeme data but I haven't found it. Is it really too early to create a team spirit here? Noé (talk) 06:34, 16 May 2020 (UTC)

@Noé: the « team » is hidden here in plain sight ;)
Yes, again it's here. Welcome!
Not that I know of, but feel free to propose one!
Second time (at least, after #User box for the Lexicographical data project ?) it is asked, so I created one: {{User LexData}} (with the glyph of ama/𒂼 (L1) as the image in the meantime, can someone activate the translation balise?). Team spirit exist without symbol but symbol are indeed useful for team spirit.
Cheers, VIGNERON (talk) 10:05, 16 May 2020 (UTC)
So, if it's a WikiProject, I made a recat into the category for Wikiprojects. For the team, there is no list of participants, like for other projects, but it is maybe more a Wikipedian habits than a Wikidata one. Great for the userbox! I though, for the logo, of a L made of Wikidata lines, with the same colors, but it could be hard to read in a small size. The ama sign is pretty! Noé (talk) 10:29, 16 May 2020 (UTC)

Some help to use LexData

Hi, I'm trying to use LexData to create lexemes, but I got an message like: INFO:root:Maxlag hit, waiting for 5.0 seconds. If the lexeme already exists, everything is okay, but when the lexeme doesn’t exist, I’ve got this message. What am I supposed to do? Thanks! Lepticed7 (talk) 08:58, 18 May 2020 (UTC)

@Lepticed7: You are very likely not doing anything wrong; Wikidata disallows "bot" edits when the "maxlag" value is too high, due to too many backlogged edits that need to be processed. This happens quite often, I suggest you just wait a few minutes and try again (maybe several times). There are also grafana charts that can show you the current maxlag value so you know when it won't work, this one in particular. ArthurPSmith (talk) 17:39, 19 May 2020 (UTC)

Please double check this lexeme

Hi, please check låne (L300647) and tell me if it is wrong. I'm new at this. Iwan.Aucamp (talk) 00:29, 20 May 2020 (UTC)

@Jon Harald Søby: could you take a look? Cheers, VIGNERON (talk) 10:28, 20 May 2020 (UTC)
@Iwan.Aucamp: It looks alright to me, except I don't understand why there are grammatical gender (P5185) and requires grammatical feature (P5713) statements on the forms, that seems redundant to the grammatical features listed. Also the S2 is – AFAIK – only in the phrase "låne tid", not for the base form "låne". Jon Harald Søby (talk) 14:26, 20 May 2020 (UTC)
@Jon Harald Søby: Thanks for the feedback, I adjusted it and removed the sense. I'm not very familiar with best practices and just trying to get the feel for it so the input is much appreciated. Iwan.Aucamp (talk) 14:53, 20 May 2020 (UTC)

Model lexemes and language communities

Is there a place where we can define model lexemes for languages? Maybe a new property is needed for that similar to model item (P5869)? I think if we have some model lexemes for each language it will make it easier to manage.

Also is there some approach for community coordination around specific languages similar to wikiprojects?

Iwan.Aucamp (talk) 14:54, 20 May 2020 (UTC)

@Iwan.Aucamp: some people started pages for specific languages listed on Wikidata:Lexicographical data/Documentation/Languages. Some of these pages are not bad (I worked a lot on Breton : Wikidata:Lexicographical data/Documentation/Languages/br) but most are still stubs with only basic informations. Feel free to create one. Also, to all lexical lovers, it would be nice to have feedbacks to improve them and having a coherent structuration (not to be strongly enforce but suggestions most sections could be similar). Any comments are welcome on Wikidata talk:Lexicographical data/Documentation/Languages. Cheers, VIGNERON (talk) 15:56, 20 May 2020 (UTC)

User box for the Lexicographical data project ?

Hi,

the title is the question   is there already a user box ? if not, could we create one ? --Hsarrazin (talk) 12:55, 1 April 2020 (UTC)

I don't think there is one yet. Feel free to create one :) Lea Lacroix (WMDE) (talk) 12:27, 3 April 2020 (UTC)
  Done {{User LexData}} (see also #Wikidata:WikiProject Lexicographical Data ?). Cheers, VIGNERON (talk) 09:28, 24 May 2020 (UTC)

Gender in French

@Lepticed7: Does Lexeme:L241 really have 2 different genders? I'd propose to separate this into 2 lexemes: "chien" and "chienne" with definite genders. --Infovarius (talk) 00:20, 17 May 2020 (UTC)

@Infovarius: Hi! The fact is that in French, two things characterize nouns: gender and number. For gender, we have both masculine and feminine, and (almost?) every noun in French are either masculine, feminine, or sometimes both (it can change depending on multiple factors. One word in this case is « chips »). We even have nouns that are masculine when singular, but feminine when plural (like « amour » (love)). For the classic pet animals (dog, cat) or for the farm animals (cow, pig, chicken, etc.), we have a version of the word to identify male animals (« chien », « chat », « canard », « verrat »), these words are masculine; and we have the feminine words to identify female animals (« chienne », « chatte », « canne », « truie »). Some are "just" inflections (is this the right word?) using suffixes, generally « -e », and some are not (like « verrat » and « truie »). It could be great to have more points of views on this topic, but because gender and number characterize a form, and not a lexeme, and words in French (nouns or adjectives) are presented by giving this pair, I think we should not present this information on the lexeme, but on the forms. And we should not separate these two pieces of information. Lepticed7 (talk) 02:04, 17 May 2020 (UTC)
flexion in French is inflection in English, désinence is verbal inflection. It sounds like a heavy metal band to me! Noé (talk) 08:55, 17 May 2020 (UTC)
My opinions is that "verrat"/"truie" are not different forms of 1 word but are different words (linked with some relation). We have the same in Russian: кабан/свинья. Nouns are not inflected by genders (like adjectives)! They have gender! --Infovarius (talk) 23:52, 18 May 2020 (UTC)
It’s okay for me this way. I modified in first place because the lexeme presented the masculine and feminine gender. But, if for nouns, we do separate lexemes for genders, I agree :D Lepticed7 (talk) 07:42, 19 May 2020 (UTC)
Separating or not lexeme based on gender is an open question since the beggining of the Lexemes (and even before, as it was an example also during the test plateform), qv. Wikidata_talk:Lexicographical_data/Archive/2019/05#Lexemes_and_gender_of_noun for instance (where I list some cases).
I don't have a strong opinion but must say, I'm not really convinced but the "two lexemes based on gender" solution by default. "chien" and "chienne" is the same lexeme, same lexical category, same etymology, morphology, almost same everything, except for gender. @Infovarius: « Nouns are not inflected by genders » really ? cases like chien/chienne (in French) or perro/perra (in Spanish), Lehrer/Lehrerin (in German) are clearly inflections (and this is in fact the most common case, most nouns have forms depending on gender), this is also not what the sources seems to say on en:Grammatical gender or look for "gender inflection" on Google books which give many results. « They have gender! » yes, but this gender is not always unique or even existant, they can have 0, 1, 2, 3 or more gender ; having one gender is maybe the most common case but there is a lot of exceptions (especially if you consider diachronic or dialectal data).
In the end, the situation is *very* complicated, maybe two lexemes can help be more precise but it is also more complicated and raise many more new questions. Should we create a duplicate for each gender even when the gender is unmarked like ministre (L19816)? And what about suppletion (Q324982) (when the inflected forms are not related, like "verrat"/"truie" for gender but there is also the same phenomenon for number, like "ki"/"chas" ki (L69) in Breton).
Cheers, VIGNERON (talk) 11:09, 20 May 2020 (UTC)
True but gender behave more or less the same in most languages and no language has never had a "one lexeme has only one gender" iron law, there is always exeptions. At least for French, many words don't have only one gender so we should talk in depth about how to deal with that. To start, here is the query for the (currently) 43 Lexemes with both masculine and feminine gender.
I didn't notice that you create this table {{Single or multiple lexeme}}. It's seems interresting but I'm a bit confused, why is it in the data model? was it announce somewhere? where does it come from ? (you put "from talk archive" in the summary edit but it's very vague, which talk archive?) and how should it be read?
Cdlt, VIGNERON (talk) 18:12, 22 May 2020 (UTC)

Modeling etymologies

Hello! We have now a lot of lexemes in many languages, including Latin, and this helps us building etymologies. I have been talking about this with a friend of mine who works in this area and he has given me some advices on how to proceed, but I would need your opinion on how to model it.

The derived from lexeme (P5191) logic would be this:

But he proposes to use something like this:

1 This word is still in use in some Basque dialects

How could we model the cf. (Q1048501) items? Is that a property? Or do we have a model for that?

@VIGNERON, Uziel302: -Theklan (talk) 09:06, 22 May 2020 (UTC)

@Theklan: very interresting suggestion (etymology is indeed more than just a straight line).
For « eleiza1 », I would simply put it as a form.
Indeed a "confer" property could be useful, maybe we could simply use an existing property (is there a fitting one? not related property (P1659) only for property) but a new property may be cleaner. And we use it as qualifier of derived from lexeme (P5191) or as direct property?
Cdlt, VIGNERON (talk) 09:52, 22 May 2020 (UTC)
@VIGNERON: The property "confer" is problematic, because... should it be another lexeme, or could be any string? -Theklan (talk) 09:59, 22 May 2020 (UTC)
@Theklan: definitely not a string but it coud be either a lexeme or a form of a lexeme. I would be leaning toward the first a lexeme, in most case it's precise enough and then you can use the qualifier subject form (P5830) (which means that in this case, "confers" must be a direct property and not a qualifier, since qualifiers cannot get qualifiers themselves), plus a lexeme can have (and often have) multiples homographic forms so again in this configuration, Lexeme is better.
Here a suggestion for the real and simple example of bara (L2283) :
confer
  bara (L2284)


add value

.

For the name "confer" is probably a bit too general, maybe "cognate" would be a better name for this (but maybe it's too narrow and too pedantic, and not exactly what you mean…), maybe "related lexeme" (or as an alias?).
Cheers, VIGNERON (talk) 10:29, 22 May 2020 (UTC)
I'm not an etymologist, but I think that confer is a way to compare something. If I say that eleiza (L300882) ↔ *egleisa (old Occitan) I can add there as qualifier "confer" Old Gascon gléisa.
derived from lexeme
  *egleisa
confer gléisa (Old Gascon)
0 references
add reference


add value

-Theklan (talk) 10:41, 22 May 2020 (UTC)

I have shown him this conversation and we proposes that bot cognate and confer should be created. In most of the cases both of them may be interchangeable (gléisa is cognate of eliza and bot are coming from *egleisa), but in some cases the confer property may demostrate how a word can change by analogy, and wouldn't be related to the word itself. You can show with confer a well attested vowel change that should be noted for a not well attested change in another word, as a process. Cognate would be more used in most of the cases, then. -Theklan (talk) 11:35, 22 May 2020 (UTC)

@VIGNERON: There is this proposal by Fnielsen. Theklan (talk) 13:04, 22 May 2020 (UTC)
Here an example of that could be done: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L49331&predicates=P5191 -Theklan (talk) 17:41, 22 May 2020 (UTC)
Hello ! Confer or derivated is not good IMHO. I do etymology on french Wiktionary since long time and if you want to link properly two lexeme with an etymology link, you need to explain four angles of analysis : morphology, historical phonetic, historical semantic and contextual analysis (like history of peoples who created the word). If you link a word with an etymon, you need to explain why, in the relation. And, pay attention that very lots of words can’t be simply linked to other (especially because of the etymological structures of the lexicon). Derivated is a specific linguistic term in etymology used only for certain cases. Lyokoï (talk) 11:35, 24 May 2020 (UTC)
@Theklan: if tools (like the Wikidata lexeme graph builder, SPARQL quary and many others) can already give you the cognates, why storing it explicitly? That said, a "confer" (or whatever name) as qualifier (your solution above is indeed beter than mine) can be useful to point to a specific cognate relevant for the etymology.
@Lyokoï: very interresting, could you tell us more. How would you model that? (with direct properties or qualifiers? I imagine the later ; and to compare to the current model where we already have two properties for etmology and morphology). And ideally, do you have any references about that?
Cdlt, VIGNERON (talk) 12:54, 24 May 2020 (UTC)
My 2 cents: Wikidata is a secondary database, I uploaded Latin here based on Whitaker's WORDS and I think any relation between words should be based on sources. If we have source claiming that one word is derived from another word, we should be able to link those forms, not just the lexemes. We can have multiple sources offering different etymologies. We shouldn't force any systematic relations between words, unless they are consensus on the academic sources. Let alone guessing etymologies based on forms and sound, it's very easy to get false etymology (Q17013103). Uziel302 (talk) 18:50, 24 May 2020 (UTC)
Some terms in this proposal are reconstructed words (Q55074511) and based on prior discussions, I feel it is still unclear if those are Lexeme or not and how to source them. Is it the right time to restart this conversation? Noé (talk) 10:13, 25 May 2020 (UTC)


I think Wikidata would greatly benefit if we had clear vision how it should look in the end. At this point we rely on derived from lexeme (P5191) and combines lexemes (P5238) (and it is not always done correctly). Confer property might give us some interesting views on etymologic data. But there are other issues we should deal with. For example my dictionary is full of entries saying something like "origin is unclear, maybe it has something with X, there are opinions it is unlikely" or "probably onomatopoetic origin". Sometimes we can't be sure what the most direct predecessor is (for example whether the relation between lexemes A, B, C is A → B → C or B ← A → C or A → C → B, ...). @Theklan: Do you think you could along with your friend come up with some best practices for common etymology issues on Wikidata? --Lexicolover (talk) 10:19, 26 May 2020 (UTC)
@Lexicolover: I don't think we can model something universal, and that's why we need to discuss a good practice. . In Basque language etimologies are very unclear, so only really clear etimologies are added. I would go with that, and that's why confer may give interesting information to the reader on how this etimology has been guessed. This example is a good example of what can be done: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L49331&predicates=P5191 .
@Noé: I think we need to restart that conversation, if it was closed. -Theklan (talk) 11:05, 26 May 2020 (UTC)
@Noé: yes, we absolutely need to (re-)start this discussion. A bot request is not really a good place for a discussion, there was many differents aspects and from what I understand the core problem was the lack of reference more/instead of the reconstructed lexemes themselves. Cheers, VIGNERON (talk) 12:53, 28 May 2020 (UTC)
@Lexicolover: yes, we need a clearer vision of both etymology in general (Theklan and his friend can help there) and on how to use the existing properties (here it is a job for us on the Lexemes side). For the other issues you raised, can't it be simply dealt with qualifiers? sourcing circumstances (P1480) was exactly created for the purpose of saying things like unlikely, probably, maybe, etc. Cdlt, VIGNERON (talk) 13:08, 28 May 2020 (UTC)

Notability of languages

Have we any policy about notability of languages? Look e.g. Urgesal (Q63449899). Or I can invent my language and add Lexemes in it? --Infovarius (talk) 18:42, 29 May 2020 (UTC)

There is Wikidata:Lexicographical_data/Notability#Languages. Pamputt (talk) 21:40, 29 May 2020 (UTC)
Urgesal (Q63449899) has a wikilink, so it's notable enough per Wikidata general rules (WD:N). For the notability of Lexemes, we need to make this draft page a rule (and maybe to talk about it before validating it, I think it's mostly good but at least the introduction need some rework). Cheers, VIGNERON (talk) 08:55, 30 May 2020 (UTC)
If langauage is notable when it has its own Q-item and Q-item is acceptaple for whatever is there on Wikiversity then we have a problem. The Wikiversity page states clearly "The aim of the project is to create a language, ...". I don't think Wikidata should serve as a platform for one man shows. --Lexicolover (talk) 14:47, 31 May 2020 (UTC)
Return to the project page "Lexicographical data/Archive/2020/05".