Wikidata:Requests for permissions/Bot/Pi bot 17
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 17:01, 9 January 2021 (UTC)[reply]
Pi bot 17 edit
Pi bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Mike Peel (talk • contribs • logs)
Task/s: Copy labels for humans from other languages to English
Code: Available on BitBucket
Function details: The code runs a query for instance of (P31)=human (Q5) items that do not have an English label but do have one in a similar language (currently defined as 'de,fr,es,pt,nl,it,sv,pl' - with fallback in that order), and copies the label to English. It excludes names that aren't in ASCII (this excludes letters like ł, but also é, and could do with improvement in the future - suggestions for how to better code this in Python would be appreciated!).
This was originally suggested by @GZWDer: at Project Chat (permalink). @BrokenSegue, PKM, Animalparty, Jura1: also commented there and may want to follow up here. Input fromDereckson
Harmonia Amanda
Hsarrazin
Jura
Чаховіч Уладзіслаў
Joxemai
Place Clichy
Branthecan
Azertus
Jon Harald Søby
PKM
Pmt
Sight Contamination
MaksOttoVonStirlitz
BeatrixBelibaste
Moebeus
Dcflyer
Looniverse
Aya Reyad
Infovarius
Tris T7
Klaas 'Z4us' van B. V
Deborahjay
Bruno Biondi
ZI Jony
Laddo
Da Dapper Don
Data Gamer
Luca favorido
The Sir of Data Analytics
Skim
E4024
Joeykentin
Envlh
Susanna Giaccai
Epìdosis
Aluxosm
Dnshitobu
Ruky Wunpini
Balû
★Trekker
Example edits: [1], [2], [3], [4]
Thanks. Mike Peel (talk) 19:35, 21 December 2020 (UTC)[reply]
- Maybe you'd want to select based on P27 values, e.g. countries de,fr,es,pt,nl,it,sv,pl? The likelihood of transferring Cyrillic transcriptions from one language to another is reduced. Obviously, this would delay labels for items that don't have a P27 statement. --- Jura 19:48, 21 December 2020 (UTC)[reply]
- @Jura1: human (Q5) items don't tend to have country of citizenship (P27) values... Thanks. Mike Peel (talk) 19:53, 21 December 2020 (UTC)[reply]
- Two of the samples have them (I mean wdt:P27 ?country ). Two other could work with country of birth ( wdt:P19/wdt:P17 ?country ). --- Jura 20:06, 21 December 2020 (UTC)[reply]
- @Jura1: I've coded up exclusions for when country of citizenship (P27)=Russia (Q159) or Soviet Union (Q15180), and also following through from place of birth (P19) has country of citizenship (P27) values. Exclusions are far easier than whitelists, particularly since P27 or P19 values won't be present for most items that this would edit. If you can give me a list of QIDs to exclude, then I can add them to the code. Thanks. Mike Peel (talk) 20:41, 21 December 2020 (UTC)[reply]
- I tend to do the opposite (whitelist). Maybe this could work (no warranty). --- Jura 22:02, 21 December 2020 (UTC)[reply]
- I've added a few from there, but there seem to be a lot of countries on the list that I would expect this script to work OK for (e.g., Ireland, India). Thanks. Mike Peel (talk) 19:37, 22 December 2020 (UTC)[reply]
- I tend to do the opposite (whitelist). Maybe this could work (no warranty). --- Jura 22:02, 21 December 2020 (UTC)[reply]
- @Jura1: I've coded up exclusions for when country of citizenship (P27)=Russia (Q159) or Soviet Union (Q15180), and also following through from place of birth (P19) has country of citizenship (P27) values. Exclusions are far easier than whitelists, particularly since P27 or P19 values won't be present for most items that this would edit. If you can give me a list of QIDs to exclude, then I can add them to the code. Thanks. Mike Peel (talk) 20:41, 21 December 2020 (UTC)[reply]
- Two of the samples have them (I mean wdt:P27 ?country ). Two other could work with country of birth ( wdt:P19/wdt:P17 ?country ). --- Jura 20:06, 21 December 2020 (UTC)[reply]
- @Jura1: human (Q5) items don't tend to have country of citizenship (P27) values... Thanks. Mike Peel (talk) 19:53, 21 December 2020 (UTC)[reply]
- (ec) You need to ensure that the name has latin script originally, otherwise you mess this up as transcription is not necessarily identical for English and the language you want to copy the label from. Jura already mentioned this in the Project chat discussion. —MisterSynergy (talk) 19:49, 21 December 2020 (UTC)[reply]
- @MisterSynergy: That's why I added the ASCII check. Open to other suggestions for how to implement that or add additional checks? Thanks. Mike Peel (talk) 19:53, 21 December 2020 (UTC)[reply]
- The ASCII check does not help here at all. If you, for instance, have a German transcription of a cyrillic name, it does not have any non-ASCII characters (almost always), but the German transcription is not suitable as an English transcription. In other words: a ru->de transcription is different from ru->en, thus you cannot copy en=de if the German label was transcribed from the Russian original as ru->de. Again per Jura above, I think that you need to limit this to people with citizenship in countries with (mainly) latin script languages. —MisterSynergy (talk) 19:58, 21 December 2020 (UTC)[reply]
- @MisterSynergy: How about if I exclude any items with an ru label? Thanks. Mike Peel (talk) 20:02, 21 December 2020 (UTC)[reply]
- This is not limited to ru. It applies for sure for all cyrillic scripts, and to a lesser degree to all other languages that need to be transcribed. —MisterSynergy (talk) 20:06, 21 December 2020 (UTC)[reply]
- @MisterSynergy: Do you have a list of other languages to exclude? I've modified the code so it takes a list of languages to exclude. Thanks. Mike Peel (talk) 20:13, 21 December 2020 (UTC)[reply]
- No, I don't; but I do know that there are plenty of non-latin script languages. It might be easier to whitelist latin script languages, minus the ones which are categorically unsuitable as a source for English labels (such as hu, lv, and some others). —MisterSynergy (talk) 20:22, 21 December 2020 (UTC)[reply]
- @MisterSynergy: OK, so you're suggesting excluding items if they include a label in any language other than 'de,fr,es,pt,nl,it,sv,pl'? Thanks. Mike Peel (talk) 20:29, 21 December 2020 (UTC)[reply]
- IMO the best would still be to rely on P27 values. Otherwise you can use a longer list of latin-script labels than just 'de,fr,es,pt,nl,it,sv,pl'. For the nameGuzzler script, users usually use a set of around 80 latin script language codes that would work. —MisterSynergy (talk) 20:57, 21 December 2020 (UTC)[reply]
- @MisterSynergy: I still don't want to rely on a property that won't be present in most cases that the bot would edit, which would particularly be the case for newer items. If you can provide a longer list of good/bad language names, I can add them to the code. Thanks. Mike Peel (talk) 21:02, 21 December 2020 (UTC)[reply]
- You are taking quite some risk here if you copy labels about persons we don't know much about. Anyways, I use User:MisterSynergy/nameGuzzlerOption.js for nameGuzzler, but I did copy this seletion from others and do not guarantee that all are actually fine there :-) —MisterSynergy (talk) 21:10, 21 December 2020 (UTC)[reply]
- If the bot gets some wrong, then it's easy enough for people to change them. Worse case scenario is it's only a transcription issue (the name would still be valid in other languages, it wouldn't cause BLP issues). I've adopted the list from nameguzzler, with some reordering/tweaks. Thanks. Mike Peel (talk) 19:37, 22 December 2020 (UTC)[reply]
- You are taking quite some risk here if you copy labels about persons we don't know much about. Anyways, I use User:MisterSynergy/nameGuzzlerOption.js for nameGuzzler, but I did copy this seletion from others and do not guarantee that all are actually fine there :-) —MisterSynergy (talk) 21:10, 21 December 2020 (UTC)[reply]
- @MisterSynergy: I still don't want to rely on a property that won't be present in most cases that the bot would edit, which would particularly be the case for newer items. If you can provide a longer list of good/bad language names, I can add them to the code. Thanks. Mike Peel (talk) 21:02, 21 December 2020 (UTC)[reply]
- IMO the best would still be to rely on P27 values. Otherwise you can use a longer list of latin-script labels than just 'de,fr,es,pt,nl,it,sv,pl'. For the nameGuzzler script, users usually use a set of around 80 latin script language codes that would work. —MisterSynergy (talk) 20:57, 21 December 2020 (UTC)[reply]
- @MisterSynergy: OK, so you're suggesting excluding items if they include a label in any language other than 'de,fr,es,pt,nl,it,sv,pl'? Thanks. Mike Peel (talk) 20:29, 21 December 2020 (UTC)[reply]
- No, I don't; but I do know that there are plenty of non-latin script languages. It might be easier to whitelist latin script languages, minus the ones which are categorically unsuitable as a source for English labels (such as hu, lv, and some others). —MisterSynergy (talk) 20:22, 21 December 2020 (UTC)[reply]
- @MisterSynergy: Do you have a list of other languages to exclude? I've modified the code so it takes a list of languages to exclude. Thanks. Mike Peel (talk) 20:13, 21 December 2020 (UTC)[reply]
- This is not limited to ru. It applies for sure for all cyrillic scripts, and to a lesser degree to all other languages that need to be transcribed. —MisterSynergy (talk) 20:06, 21 December 2020 (UTC)[reply]
- @MisterSynergy: How about if I exclude any items with an ru label? Thanks. Mike Peel (talk) 20:02, 21 December 2020 (UTC)[reply]
- The ASCII check does not help here at all. If you, for instance, have a German transcription of a cyrillic name, it does not have any non-ASCII characters (almost always), but the German transcription is not suitable as an English transcription. In other words: a ru->de transcription is different from ru->en, thus you cannot copy en=de if the German label was transcribed from the Russian original as ru->de. Again per Jura above, I think that you need to limit this to people with citizenship in countries with (mainly) latin script languages. —MisterSynergy (talk) 19:58, 21 December 2020 (UTC)[reply]
- @MisterSynergy: That's why I added the ASCII check. Open to other suggestions for how to implement that or add additional checks? Thanks. Mike Peel (talk) 19:53, 21 December 2020 (UTC)[reply]
- I’m not sure about using pl, only because pl speakers are always on about NOT using en, es, etc, labels in pl because they always phonetically re-spell non-Polish names. Otherwise, I think this is fabulous. - PKM (talk) 20:51, 21 December 2020 (UTC)[reply]
- @PKM: OK, I've removed pl from the whitelist. Thanks. Mike Peel (talk) 20:57, 21 December 2020 (UTC)[reply]
- Support - PKM (talk) 21:02, 21 December 2020 (UTC)[reply]
- @PKM: OK, I've removed pl from the whitelist. Thanks. Mike Peel (talk) 20:57, 21 December 2020 (UTC)[reply]
- generally sgtm BrokenSegue (talk) 21:30, 21 December 2020 (UTC)[reply]
- @Mike Peel: the function at https://github.com/multichill/toollabs/blob/master/bot/wikidata/ulan_alias_import.py#L98 might help you. It filters for Latin1 and Latin2. Just check that no labels exist that are not latin1/latin2 and you should be good on the transliteration front (except of course the case where we the label in the original language is missing). Multichill (talk) 09:27, 22 December 2020 (UTC)[reply]
- @Multichill: Thanks, that seems to work nicely. Thanks. Mike Peel (talk) 19:47, 22 December 2020 (UTC)[reply]
- An alternative approach could be to add the (non-English) labels as English aliases when it's not clear if they should be English labels. --- Jura 09:34, 22 December 2020 (UTC)[reply]
- @Mike Peel: Why copy them to "en" and not to "mul"? If the name is the same in multiple languages copying it to English and a bunch of other languages wastes valuable database space. ChristianKl ❪✉❫ 15:36, 22 December 2020 (UTC)[reply]
- @ChristianKl: Because 'en' is at the end of the fall-back chain, so if it can't be found in another language then 'en' will be used. However, if 'en' is not set then it might fall back to the QID. Thanks. Mike Peel (talk) 15:39, 22 December 2020 (UTC)[reply]
- That's not true, mul is at the end of the fall-back chain after en. ChristianKl ❪✉❫ 15:43, 22 December 2020 (UTC)[reply]
- @ChristianKl: OK, I'm not understanding something then, can you explain? I don't see 'mul' anywhere? I'm also not actually sure where the language fallback list is, all I can find is Special:MyLanguageFallbackChain, which is just for individual users not for wikis... Thanks. Mike Peel (talk) 15:57, 22 December 2020 (UTC)[reply]
- @Mike Peel: I'm sorry it seems like a overinterpreted mul being added on https://www.wikidata.org/wiki/Help:Monolingual_text_languages to mul being generally available. ChristianKl ❪✉❫ 19:59, 22 December 2020 (UTC)[reply]
- It seems it's a fairly trivial development step to make this generally available. --- Jura 06:41, 23 December 2020 (UTC)[reply]
- @Mike Peel: I'm sorry it seems like a overinterpreted mul being added on https://www.wikidata.org/wiki/Help:Monolingual_text_languages to mul being generally available. ChristianKl ❪✉❫ 19:59, 22 December 2020 (UTC)[reply]
- @ChristianKl: OK, I'm not understanding something then, can you explain? I don't see 'mul' anywhere? I'm also not actually sure where the language fallback list is, all I can find is Special:MyLanguageFallbackChain, which is just for individual users not for wikis... Thanks. Mike Peel (talk) 15:57, 22 December 2020 (UTC)[reply]
- That's not true, mul is at the end of the fall-back chain after en. ChristianKl ❪✉❫ 15:43, 22 December 2020 (UTC)[reply]
- @ChristianKl: Because 'en' is at the end of the fall-back chain, so if it can't be found in another language then 'en' will be used. However, if 'en' is not set then it might fall back to the QID. Thanks. Mike Peel (talk) 15:39, 22 December 2020 (UTC)[reply]
- To select items and labels, how about queries like this? It currently gives 42038 items, all but 144 with labels. --- Jura 06:41, 23 December 2020 (UTC)[reply]
- I still prefer to avoid depending on country of citizenship (P27) if possible. Thanks. Mike Peel (talk) 20:19, 24 December 2020 (UTC)[reply]
- @Mike Peel: Is it possible to skip over any with a number in them? I've seen occasional examples in the past of nobility having labels copied over like this, which can have the odd effect of English labels with eg "4e Comte de X" or "2. Earl of Y". Andrew Gray (talk) 21:54, 23 December 2020 (UTC)[reply]
- @Andrew Gray: This seems sensible, I've implemented a check to avoid numbers (within the 'isEnglish' function). Thanks. Mike Peel (talk) 20:19, 24 December 2020 (UTC)[reply]
Names in Portuguese edit
@Lymantria: It seems that pt and pt-br biography names are identical and can be copied between (i.e., if there is a name in pt, add it to pt-br), is it OK if I implement that as an extension to this task, or should I submit a separate bot request for that, please? Thanks. Mike Peel (talk) 08:27, 8 March 2021 (UTC)[reply]
- @Mike Peel: Strict formally this request does not apply for pt/pt-br (or en/simple). I think it is better to pose a new request. Lymantria (talk) 08:56, 8 March 2021 (UTC)[reply]
- @Lymantria: Thanks, I followed this up at Wikidata:Requests for permissions/Bot/Pi bot 20. Thanks. 20:15, 24 March 2021 (UTC)[reply]