Wikidata talk:Lexicographical data/Notability

Latest comment: 1 year ago by SM5POR in topic Proper names

Dialect

edit

When I read "Note community may explicitly include or exclude a language base on consensus, such as when a proposed language is considered a dialect of or alternate name for another language", I understand that dialects would be excluded from the "languages" that can be used. I think it is a mistake. Indeed, if a contributor knows from which dialect belongs a given word, we should indicate it. The information on the main language is present by the fact the dialect is a sub-class of the language. For example, on the French Wiktionary project, we have several categories for the various dialect of the Romansch language (see Catégorie:Dialecte vallader en romanche for example). If I misunderstood, could you clarify? Pamputt (talk) 22:12, 7 August 2017 (UTC)Reply

@Pamputt: For example community can discuss whether to have individual Serbian and Croatian entries, or to have entries applicable to both variants.--GZWDer (talk) 05:47, 31 August 2017 (UTC)Reply
@GZWDer: good example. In tht case, I think exactly that Serbian (Q9299), Croatian (Q6654), Montenegrin (Q8821), Bosnian (Q9303) and Serbo-Croatian (Q9301) must all be available. The reason is that some users woule like to contribute in one of this "language" and not in the other. Allowing all languages and dialects avoid us some debates that are both linguistical and political. Since items for "related" languages can be linked together, it should be quite easy to extract all informations for Serbo-Croatian (Q9301) using information from the four other "languages". Pamputt (talk) 05:58, 31 August 2017 (UTC)Reply

Licenses (initially titled Reconstructed languages)

edit

Hello,

Protolanguages are works made by researchers and published in databases, articles or books. Those can be protected by laws, for authorship or editors rights. Is it possible to consult a lawyer to have an expertise on how we can use this kind of information. Thank you in advance to give a proper attention to this issue. Noé (talk) 14:59, 17 August 2017 (UTC)Reply

More generally, I'm curious to know what would be the license aimed for Wiktionary related items within wikidata. If we want to take advantage of existing Wikitionary material, we obviously can not go with CC-0. Is it possible to have a different license than CC-0 in a specific namespace of Wikidata? Could any of @Christopher Johnson (WMDE):, @John Erling Blad (WMDE):, @Lydia Pintscher (WMDE):, confirm/infirm this possibility? --Psychoslave (talk) 07:26, 25 August 2017 (UTC)Reply
@Noé, Psychoslave: I instead think many data, such as word lists, inflected forms, pronouncations (as individual information piece, not a dataset) are uncopyrightable as they don't involve any originality. If we insist that we don't want any data derived from copyrighted data, I think more than 99% of Wikidata data will be deleted.--GZWDer (talk) 05:44, 31 August 2017 (UTC)Reply
I do not want to discuss this issue here, as we already started there. Can we get back to the specific issue of protolanguages? See also English Wiktionary policy and French Wiktionary policy. -- Noé (talk) 07:19, 31 August 2017 (UTC)Reply
Thank you for all the links @Noé:. Well, I surely can't help more with the issue there, a lawyer would probably more relevant, as you said. --Psychoslave (talk) 11:30, 31 August 2017 (UTC)Reply

Actually Rua wants to add reconstructed forms that do not come from "acadamic works". So I think a conclusion should arise before any massive import. Pamputt (talk) 15:25, 27 September 2018 (UTC)Reply

Attestation : serious source

edit

Hi again,

Please define 'serious source'. Is a video recording of an endangered language spoken by two old drank woman considered as a serious source? Is a academic policy without any consequence on daily life a serious source? The second criteria, to assume a word exist if it is in use, so that someone use it and another person use it again with the same meaning, this seams more convincing for me. Noé (talk) 15:05, 17 August 2017 (UTC)Reply

@Noé: The intention of this rule is to allow some well discussed hapax legomenon (Q168417) (especially in limited documented languages) without being spammed by user coined terms.--GZWDer (talk) 05:33, 31 August 2017 (UTC)Reply
I understand the intention, and we had the same one in wiktionaries since ages, but we have to be very careful on the way we frame a rule on sources. Wiktionaries are not fundamentally based on corpora but they tends to integrate more quotation as they grow up in quality. So, at some point, the corpus in use is define as all existing productions without any hierarchy. Books are used, but also articles, lyrics, recipes, manuals, blogs, etc. Plus, we also consider three kind of sources (not primary and secondary as Wikipedia does): attestation of usage (the one in debate here), audio recording attesting the pronunciation (and there is no serious sources here) and references for etymology (working almost as sources in Wikipedia, with neutrality rules). I don't think you want to put a rule on what is a serious source for a reference, so it may be good to specify that you exclude audio recordings and references in your rule. By the way, English is not my mother tongue so I will not be offended if you ask me to rephrase something obscure or too vague! Noé (talk) 07:10, 31 August 2017 (UTC)Reply

Taking a diametrically opposed approach

edit

I would propose to not take a really different approach, which would be to focus more on glyph sequences. For any glyph sequences (you might think "character instead of glyph, it's just more generic, and will also include w:SignWriting for example), we can always easily provide statements on the fly. That is, there should be no single input glyph string that should return only "not found" laconic statement. That is, currently querying 'There were no results matching the query. You may create a new item for "rsautiescébotbé".' All of the following is a non exhaustive list of what could be proposed (possibly as "folded by default") as automatically inferable:

  • anagrams, even meaningless one but with a filter option to toggle visibility of those unattached to any definition
  • how many time the exact string was queried
  • lengths, in terms of bit, byte and glyph for example
  • nearby strings, as in metrics like w:Levenshtein distance
  • substrings
    • morphs and morphems

Those things are statement which can be provided directly form the string itself (except for relation with "meaningful" terms), and actually would not require to have a gigantic useless string repository. Only for the number of time the string was queried there is a real need to store something (and also some inquiry regarding acceptability of storing and publishing that kind of "tracking" information within the Wikimedia movement might be required).

And rather than "notability", we should have frequency of appearance in misc. corpus. That is, one corpus, one frequency. For example, we might have a number for each localized wikimedia project (and then easily obtain sum for each project and total occurrence within a project). Of course we should also integrate, as much as we can, numbers from other corpus. We should be careful that tokenisation (and possibly lematisation) use a congruent methodology. We might for example integrate data from Google Ngram, which are licensed under CC-by-3.0 (but not in a Wikidata namespace that require CC-0, obviously).

Also, for "meaning" notability, we might have a number provided by contributor feedbacks, that is, once all that is integrated with Wiktionaries, people should be able to express "that was the definition I was looking for" (with a thumb up or whatever a UX-designer might come with), and if they do we might try to engage them even more by providing information about the source where they encountered the term.

I'm looking for your feebacks, thank you. --Psychoslave (talk) 06:54, 25 August 2017 (UTC)Reply

I would like to precise that this approach wasn't inspired by the idea of sending random glyph in queries, but generalisation of a more practical problem. In agglutinative language such as German, one can indeed come with arbitrary long words (I guess that it should ring a bell for @Christopher Johnson (WMDE):, @John Erling Blad (WMDE): and @Lydia Pintscher (WMDE):), and things become even more significant with polysynthetic languages. To my mind, we shouldn't let people with no clue about glyph sequences which are valid constructions within some languages. But they shouldn't be let clueless about the frequency use of such a construction either. Also, I didn't stipulate that before, but among corpus which should be used for providing frequencies, their should be some "oral usage" representatives. For example Lexique 3 include frequency in video subtitles to estimate oral usage frequency of covered lemma. --Psychoslave (talk) 08:00, 25 August 2017 (UTC)Reply

@Psychoslave: Wikibase does not support creating statements on the fly - this will be a fundamental change of Wikibase. I suppose that calculated property based on existing properties may finally be supported (probably also calculated forms), but I don't think we can have any data out of existing lexeme. Remember we also need to support SPARQL.--GZWDer (talk) 05:39, 31 August 2017 (UTC)Reply

Please add examples of attestation

edit

Where and how should attestation be added? Examples of references and their usage would help a lot. -- JakobVoss (talk) 15:21, 1 June 2018 (UTC)Reply

Authors of peer-reviewed papers

edit

Hallo, what is the policy for authors of peer-reviewed papers? In some cases, they are added in any case in a "short form" (but it is usually not really "short"), otherwise are created new elements. I think that following the criteria of relevancy of any "well-attested" author, it would be better to include any author of the major peer-review papers. What do you think? Cheers, --Marco Ciaramella (talk) 08:00, 21 October 2019 (UTC)Reply

@Marco Ciaramella: this page is regarding notability of lexicographical items - i.e. words in a specific language such as мир/міръ (L100000). You probably want the general notability page Wikidata talk:Notability. ArthurPSmith (talk) 17:31, 21 October 2019 (UTC)Reply
@ArthurPSmith: Oups, sorry! Yes, that link probably suits for me. Best, --Marco Ciaramella (talk) 18:16, 21 October 2019 (UTC)Reply

Proper names

edit
  • 4, part of another proper name such as a family name: Would this include any name from which a patronymic has been formed (in, say, Russian or any of the Scandinavian languages)? Anders + son => Andersson (Swedish), Finnboga + dottir => Finnbogadottir (Icelandic), Ivan + -vich => Ivanovich (Russian)? Would they be notable in any language besides their original ones?
  • 5, inflections: Does genitive (Eric => Eric's, Erik => Eriks)? I would assume not, but better spell out the rules as there may be assumptions saying otherwise.

What about given names that may have different forms in different languages (Charles/Charlie/Carolus/Karl, Ioannis/John/Johan/Hans, Andreas/Andrew/Andreij/Anders, Francis/Frank/Frans, Jacob/James/Jakob, Maria/Mary, Margaret/Margit/Margareta/Greta)? Would each such group link to the same item for this sense (P5137)?--SM5POR (talk) 08:33, 10 January 2023 (UTC)Reply

Return to the project page "Lexicographical data/Notability".