Wikidata:Requests for permissions/Bot/NirmosBot 2

NirmosBot 2 edit

NirmosBot (talkcontribsnew itemsnew lexemesSULBlock logUser rights logUser rightsxtools)
Operator: Nirmos (talkcontribslogs)

Task/s: Turn the first letter in Swedish labels to lowercase where the English label starts with lowercase.

Code: sv:User:NirmosBot/TurnWikidataLabelToLowercase.js

Function details: The code iterates over all pages on the local project where it is run, which would be Swedish Wikipedia for me initially, although I'd be open to helping other projects with identical capitalization rules (possibly nn/nb/da) in the future if it works well and if they want my help. If the page is connected to Wikidata it asks for the labels. If the Wikidata object has labels in both English and the local language (wgContentLanguage), and the English label is different from the local label, and if the English label starts with lowercase, it sets the local label to start with lowercase. Nirmos (talk) 23:09, 15 April 2021 (UTC)[reply]

The initial run did not work as well as I had hoped. Two problems:
  1. There is different word order in Swedish and English. This caused the incorrect edits Special:Diff/1402246408 and Special:Diff/1402250309. This can be solved by requiring that the English label does not contain any capital letters at all.
  2. The Swedish label can be an acronym, or at least all caps. This caused Special:Diff/1402248851. This can be solved by requiring that only the first letter in the Swedish label is a capital letter.
I have temporarily removed the transclusion of this page from Wikidata:Requests for permissions/Bot. I will re-add it when I have fixed the above issues. Nirmos (talk) 00:34, 16 April 2021 (UTC)[reply]
Unfortunately, this is not gonna work. There are cases like probability axioms (Q974605) where the labels are completely different and only one of them should be capitalized. I don't see a way of solving that. Nirmos (talk) 03:43, 16 April 2021 (UTC)[reply]

Larske: Attempting this a third time. This time, the code asserts that the page text (that you can read and select with a computer mouse, i.e neither the wikitext nor the HTML) contains lcfirst( pagename ). As with the two previous trial runs, this is throttled to 30 secs per edit, and a total number of 250 edits, so there is no need to panic. Nirmos (talk) 11:23, 17 April 2021 (UTC)[reply]

Nirmos:I guess that you should keep the check of the English label as well as the new check of the Swedish page text. Here are some examples from svwp where I think that the Swedish labels should not be changed although the presence of a the "pagename word" without capitalization in the visible page text.
In some cases the page text is faulty, so if the Swedish label is changed based on this, faults get propagated into Wikidata. Examples:
Another thing is that you maybe should include "word boundaries" in the "text inclusion check" in order to avoid that the pagename matches just part of a longer word.
  • sv:Tilapia ("tilapia" is included in the longer word "Niltilapia" in the page text, but that doesn't mean that "Tilapia" should be changed to "tilapia")
This can however lead to that some cases are missed:
  • sv:Mitos (only "mitosen", not "mitos", is present as a complete word in the page text)
--Larske (talk) 13:23, 17 April 2021 (UTC)[reply]
Larske: I haven't removed any check, only added. The script would not change "Neopets" to "neopets", or "Uppsalaskolan" to "uppsalaskolan". Nirmos (talk) 13:28, 17 April 2021 (UTC)[reply]
It would not change "Miskarp" to "miskarp" either. Nirmos (talk) 13:31, 17 April 2021 (UTC)[reply]

Larske: In the general case, I agree that when checking whether a word is written with initial lowercase, it should make sure that that letter is not in the middle of a word. However, in this specific case of tilapia (Q47793), it's a common name for a group of animals that is not a clade, like "paddor" or "nattfjärilar". The article also contains "tilapia" where it's not in the middle of a word. As such, I have now changed the script to require word boundary before lcfirst( pagename ). This means that:

  1. The script would change "Tilapia" to "tilapia"
  2. The script would change "Mitos" to "mitos"
  3. The script would not change "Kvass" to "kvass"

Starting the fourth trial run now. Nirmos (talk) 14:39, 17 April 2021 (UTC)[reply]

Nirmos:OK, my bad, I mixed up tilapia (Q47793) with the taxon Tilapia (Q1770703). Good that you kept also the comparison with English labels to be on the safe side even if you thereby will miss objects that don't have any English label but a faulty Swedish label.
I was interested to find out the magnitude of objects in need for a correction of the Swedish label, so I have investigated a random sample of 50,000 svwp-articles. You can find the result here in the svwp sandbox. My guess is that approximately 1 percent of the articles/object needs an update due to faulty versalisation of the Swedish label.
--Larske (talk) 17:55, 17 April 2021 (UTC)[reply]

250 edits now done using the latest code. Nirmos (talk) 20:58, 17 April 2021 (UTC)[reply]

Nirmos:In this batch I only have some concern about dugong (Q129544). If it really is a taxon, shouldn't it be Dudong? Maybe the script could avoid changing the labels of objects that are instance of (P31) taxon (Q16521). In many articles about taxons, like sv:Pauxi, there are lists of species with names like "Pauxi pauxi", and these would trigger a change of label from "Pauxi" to "pauxi" that would be incorrect if the English label is not correctly versalised. I haven't found any object of this type where the English label is like "pauxi", but please read on.
The versalisation of the English labels for objects that are instance of (P31) taxon (Q16521) don't look very consequent to me. Here is a sample of objects with their English and Swedish labels where the English labels differ from their respective taxon name (P225). I guess it may have to do with "trivial names", but if it is not the same way of handling that over language versions, I guess it is too difficult for a script to sort this out and find the correct versalisation.
Also the Swedish labels are in a mess, but I am not sure if it would be an improvment to change...
  • "Hamstrar" to "hamstrar" just because of the English label "cricetinae"
  • "Smultronsläktet" to "smultronsläktet" just because of the English label "strawberries"
  • "Spottspindlar" to "spottspindlar" just because of the English label "spitting spider"
  • "Ekorrar" to "ekorrar" just because of the English label "squirrel"
same goes for many of the Swedish labels given as a trivial name in plural. But the script doesn't know about plural, does it?
On the other hand, some Swedish labels that are the same as "words in a dictionary" that is normally not versalised, maybe should be changed
  • "Struts" could maybe be changed to "struts", although the English label is "Common Ostrich"
  • "Impala" could maybe be changed to "impala", although the English label is "Impala"
  • "Husbock" could maybe be changed to "husbock", although the Enlish label is "Old-house borer"
but to be on the safe side maybe the best would be to have the script avoid changing labels for taxons.
--Larske (talk) 06:09, 18 April 2021 (UTC)[reply]
Larske: "dugong" with lowercase "d" looks correct. It's a common name like "katt". And remember, it's not solely changing label based on English label – the svwiki article also needs to contain lcfirst( pagename ) with word boundary in front of it. sv:Hamstrar does not do that, so the script would not change it to "hamstrar" (to be clear, it should be "hamstrar" with lowercase "h", because it's a common name, but because that capitalization is not present as its own word in the article, the script cannot be confident enough to change it). Nirmos (talk) 07:42, 18 April 2021 (UTC)[reply]
Yes of course, having two criteria that have to agree reduces the risk for faulty changes. But my point is that even if we look at both the English label and the Swedish article, there is a risk that they happen to agree on a wrong capitalization.
The border between a "taxon name" and a "common name" is not crystal clear to me, so I might be wrong here too, but is cricetinae really at "common name" in English? And shouldn't Spottspindlar be treated as a name in Swedish and thus be capitalized when it comes to the label. Then there is another story that there is also a Swedish compound word, although not mentioned i SAOB, SAOL or SO, that is spelt "spottspindel" in singular and "spottspindlar" in plural.
Another comment I have is about page names including disambiguations, like sv:Fil (verktyg) file (Q193142). The script could ignore the "(verktyg)" part when it looks in the Swedish article and just look for fil (that will be found) instead of looking for fil (verktyg) (that will not be found).
--Larske (talk) 13:04, 18 April 2021 (UTC)[reply]