Open main menu

Wikidata:Requests for permissions/Bot/GZWDer (flood) 5

GZWDer (flood) 5Edit

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Mass import of lexemes from various (reliable) sources.

Code: Not available for now

Function details: For a source in a specific language:

  1. Export the word list (with part of speech, and forms if possible)
  2. Use SPARQL to find any existing lexemes that may be duplicates. Remove them.
  3. Create lexemes for non-existent ones.
  4. If the source is in public domain, also add senses. (For existing lexemes, senses may be created if no senses exists. Lexemes with existing senses will be skipped.) If the source is copyrighted, only the words will be imported.

--GZWDer (talk) 17:29, 19 January 2019 (UTC)


DiscussionEdit

  •   Oppose I think every source should be discussed separately. KaMan (talk) 17:42, 19 January 2019 (UTC)
  •   Comment A. Are we allowed to mass-import lexemes? I didn't think that was permitted yet. B. Wikidata's Lexeme structure is different from what most sources would have - I think we REALLY need to see a good set of sample edits (and perhaps the source code too) before starting on this. Definitely do not allow this without samples of the bot's work. And each separate source should be requested as a separate task (and samples provided before approval). ArthurPSmith (talk) 19:41, 19 January 2019 (UTC)
    • As currently we don't have consensus for mass-import lexemes, I'm filing a request to obtain one. Also it is easy not to import duplicates as long as we check the existing list of lemmas.--GZWDer (talk) 20:14, 19 January 2019 (UTC)
      • Ok, specify an example source and let's see at least 10 proposed examples either implemented or with enough detail that we can tell what you are doing. Lexemes are more than just words so just importing a word list is NOT what we want here. How do you determine lexical category? How do you generate forms? How do you check for alternate representations (spelling variants)? These are important details! ArthurPSmith (talk) 16:11, 1 February 2019 (UTC)
      • One example of a potential source that has data structured in a reasonably similar fashion might be WordNet (for English). The words it includes could be considered lemmas for lexemes, as they deliberately remove all inflected forms. However that means it could not itself be a source for those forms. It groups words into synonym sets so that senses could be generated I think somewhat automatically from them. So this would be an interesting collection to pursue via automation. But there are still a lot of details that would need to be examined to make sure we were doing something sensible with the automated import. ArthurPSmith (talk) 14:48, 5 February 2019 (UTC)
  •   Oppose per Arthur in full. Mahir256 (talk) 21:53, 4 February 2019 (UTC)
  • I have created some example lexemes. Comments welcome, but more significant work will not start until July. In the future probably millions of lexemes will be imported (as much sources are copyrighted, we may expect lexemes without senses as I will not import them.)--GZWDer (talk) 23:54, 14 February 2019 (UTC)
    • @GZWDer: Thanks! So what you've done here looks reasonable to me. I'm certainly not familiar with Welsh, but it looks like you are importing Welsh verb infinitives as lexemes from an out-of-copyright Welsh-English dictionary, adding the English senses as sense S1 on each one. No forms, or secondary senses. So language and lexical category are clear, and it appears this source wouldn't have multiple forms for the same lexeme in different places, so we shouldn't need to worry about duplication there. The one thing that might be nice to add would be at least one form (presumably this dictionary is using a standard form for the verbs?). Also if you are planning to add other lexemes from this source it would be nice to see how you would handle other lexical categories - is "Adar, n. p. birds, fowls" the plural form of "Adain, n. a wing; a bird" or are they different lexemes? It also looks like this source doesn't include any proper nouns so we don't have to worry about capitalization issues. Anyway, in general I'd   Support this particular case, but I think there's still a bit more to work out with it. ArthurPSmith (talk) 15:28, 15 February 2019 (UTC)
      • Further comment here - I'm not sure if you've imported Lexeme:L42622 correctly - the source seems to use a semicolon character (';') to indicate separate senses, so I think that should be 2 senses, not 1, in this case. See the next entry - 'Absenwr, n. m. backbiter; absentee; slanderer' where 'absentee' is clearly a distinct meaning. Many more examples further down, such as 'Ach, n. f. a fluid liquid; a stem' which are even clearer on this. Also it would be nice if the source could be directly referenced as the source of the gloss on the sense. Not sure we have a mechanism to do that right now. ArthurPSmith (talk) 15:47, 15 February 2019 (UTC)
      • The workflow of import may be improved (senses will be split); for now, the 100 entries I have imported may be manually fixed. --GZWDer (talk) 12:26, 16 February 2019 (UTC)
        So do you plan to correct them manually? KaMan (talk) 12:43, 16 February 2019 (UTC)
        • Corrected.--GZWDer (talk) 13:06, 16 February 2019 (UTC)
    • I think it's time to move ahead with bot created items. I'm not really convinced the project has progressed much in recent months as far as lexemes concerned. The above can give a much needed fresh productive contribution. @Llywelyn2000: what do you think of the newly created Welsh language lexemes? --- Jura 16:41, 15 February 2019 (UTC)
  • Hesitating between   Support and   Wait.   Comment interesting but the examples like adgyffroi (L42717) need some work to reach what I think is the minimal level. Lexemes should always have at least one form (the main lemma), described by source (P1343) is very good but could we have the page(s) (P304) too and maybe it would be even better to have several value in described by source (P1343), it would tackle all the copyright and reliability problems. PS: a native speaker review is a condition 'sine qua none' (especially as the language has changed a lot for some languages, if you would import the Lexique étymologique du breton moderne (Q19216625) - I'm working on it on Wikisource right now with the plan to import it on Lexemes one day ;) - a lot of lemma would be to rectified before import). Cdlt, VIGNERON (talk) 16:43, 15 February 2019 (UTC)
  • Note I also plan to import copyrighted sources - but only words themselves, not any definitions, so we will have many lexemes without senses. Anyway any further action will be after July. By the way, many sources I found does not have part of speech information, so we may want to set up something like mix'n'match to handle them (this is also useful for online resources like Wiktionary where entries are not fully reliable).--GZWDer (talk) 12:26, 16 February 2019 (UTC)
    That's why I think every import source should be discussed separately, not in one request for permission. KaMan (talk) 12:43, 16 February 2019 (UTC)
  •   Oppose I agree with KaMan, each import source should be discussed separately. Please open new requests for permissions for each source. Pamputt (talk) 10:22, 17 February 2019 (UTC)