Wikidata:Requests for permissions/Bot/AmpersandBot 2

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Not done @PinkAmpersand: This request seems to be abandoned, please reopen it if that is not the case. Thanks. Mike Peel (talk) 20:12, 21 July 2020 (UTC)[reply]

AmpersandBot

AmpersandBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: PinkAmpersand (talk • contribs • logs)

Task/s: Generate descriptions for village items in the format of "village in <place>, <place>, <country>"

Code: https://github.com/PinkAmpersand/AmpersandBot/blob/master/village.py

Function details: With my first approved task (approved in July 2016, but not completed until recently), I set descriptions for about 20,000 Ukrainian villages based on their country (P17), instance of (P31), and located in the administrative territorial entity (P131) values. Now, I would like to use the latter two values to generalize this script to—ominous music—every village in the world!

The script works as follows:

It pulls up 5,000 items backlinking to village (Q532)
It checks whether an item is instance of (P31)village (Q532)
It then labels items as follows:
1. It removes disambiguation from labels in any language:
  1. It runs a RegEx search for ,| \(
  2. It removes those characters and any following ones
  3. It sets the old label as an alias for the given language
  4. If the alias is in Unicode, it creates an ASCII version and sets that as an alias as well
  5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
2. It sets labels in all Latin-script languages:
  1. It checks if the current Latin-script languages all use the same label.
  2. If they don't, it does nothing except log the item for further review.
  3. If they do, it sets that label as the label for all other Latin-script languages, using a list of 196 (viewable in the source code)
  4. If the label is in Unicode, it also sets an ASCII version of the label as an alias
  5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
And describes items as follows:
1. It checks whether the item either a) lacks an English description or b) has an English description that merely says "village in <country>" or "village in <region>". (I've manually coded into the RegEx the names of every multi-word country. This still leaves a blind spot for multi-word entities other than countries. I welcome advice on how to fix this.)
2. If so, it gets the item's parent entity. If that entity is a country, it describes the item as "village in <parent>"
3. If the parent entity is not a country, it checks the grandparent entity. If that is a country, it describes the item as "village in <parent>, <grandparent>"
4. Next onto the great-grandparent entity. "village in <parent>, <grandparent>, <great-grandparent>"
5. For the great-great-grandparent entity, only the top three levels are used: "village in <grandparent>, <great-grandparent>, <great-great-grandparent>". This is slightly more likely to result in dupe errors, but the code handles those.
6. Ditto the thrice-great-grandparent entity.
7. If even the thrice-great-grandparent is not a country, the item is logged for further review. If people think I should go deeper, I am willing to; I may do so of my own initiative if the test run turns up too many of these errors.
After 5,000 items have been processed, another 5,000 are pulled. The script continues until there are no backlinks left to describe.

Does this sound good? — PinkAmpers&^{(Je vous invite à me parler)} 01:43, 22 February 2018 (UTC) ^{Updated 22:17, 3 March 2018 (UTC)}[reply]

Test run here. The only issue that arose was some items, like Koro-ni-O (Q25694), being listed in my command line as updated, but not actually updating. It's a bug, and I'll look into it, but its only effect is to limit the bot's potential, not to introduce any unwanted behavior. — PinkAmpers&^{(Je vous invite à me parler)} 02:16, 22 February 2018 (UTC)[reply]

I will approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 08:39, 25 February 2018 (UTC)[reply]

Cool, thanks! But actually, I'm working on a few more things for the bot to do to these village items while it's "in the neighborhood", so would you mind holding off until I can post a second test run? — PinkAmpers&^{(Je vous invite à me parler)} 00:23, 26 February 2018 (UTC)[reply]

This is fine, no problem.--Ymblanter (talk) 10:42, 26 February 2018 (UTC)[reply]

@Ymblanter:. Okay. I'm all done. I've updated the bot's description above. Diff of changes here. New test run here. There was one glitch in this test run, namely that the bot failed to add ASCII aliases for Unicode labels while performing the Latin-script label unanimity function. This was due to a stray space before the word aliases in line 247. I fixed that here, and ran a test edit here to check that that worked. But I'm happy to run a few dozen more test edits if you want to see that fix working in action. — PinkAmpers&^{(Je vous invite à me parler)} 22:17, 3 March 2018 (UTC)[reply]

Concerning the Latin script languages, not all of them use the same spelling. For example, here I am sure that in lv it is not Utvin (most likely Utvins), in lt it is not Utvin, and possibly in some other languages it is not Utvin (for example, crh uses fonetic spelling, Utvin may be fine, but other names will not be fine). I would suggest to restrict this part of the task to major languages (say German, French, Spanish, Portuguese, Italian, Danish, Swedish, may be a couple of more) and for others make some research - I have no ideas for example what Navajo uses). The rest seems to be fine.--Ymblanter (talk) 07:48, 4 March 2018 (UTC)[reply]

I'm concerned about exonyms too. Even if a language uses the same name variant as other Latin-script languages for most settlements, then there are particular settlements for which it may not do so. 90.191.81.65 14:30, 4 March 2018 (UTC)[reply]

I considered that, 90.191.81.65, but IMHO it's not a problem. The script will never overwrite an earlier label, and indeed won't change the labels unless all existing Latin-script labels are in agreement. So the worst-case scenario here is that an item would go from having no label in one language to having one that is imperfect but not incorrect. An endonym will always be a valid alias, after all. — PinkAmpers&^{(Je vous invite à me parler)} 21:37, 4 March 2018 (UTC)[reply]

I'm not sure that all languages consider an endonym as a valid alias if there's an exonym too. And if it is considered technically not incorrect then for some cases an endonym would still be rather odd. My concern on this is similar to one currently brought up in project chat. 90.191.81.65 07:58, 5 March 2018 (UTC)[reply]

I would think that an endonym is by definition a valid alias. The bar for "valid alias" is pretty low, after all. So if there isn't consensus to use endonyms as labels, I can set them as aliases instead. — PinkAmpers&^{(Je vous invite à me parler)} 17:51, 5 March 2018 (UTC)[reply]

Also, all romanized names are probably problematic. Many languages may use the same romanization system (the same as in English or the one recommended by the UN) for particular foreign language, but there are also languages which have their own romanization system. So a couple of the current Latin-script languages using the same romanization would be merely a coincidence. 90.191.81.65 14:49, 4 March 2018 (UTC)[reply]

I'm confused about your concern here. The only romanization that the script does is in setting aliases, not labels. — PinkAmpers&^{(Je vous invite à me parler)} 21:37, 4 March 2018 (UTC)[reply]

All Ukrainian, Georgian, Arab etc. place names apart from exonyms are romanized in Latin-script languages. And there are different romanization systems, some are specific to particular language, e.g. Ukrainian-Estonian transcription. For instance, currently all four Latin labes for Burhunka (Q4099444) happen to be "Burhunka", but that wouldn't be correct in Estonian. 90.191.81.65 07:58, 5 March 2018 (UTC)[reply]

Well that's part of why I'm using a smaller set of languages now. Can you give me examples of languages within the set that have this same problem? — PinkAmpers&^{(Je vous invite à me parler)} 17:51, 5 March 2018 (UTC)[reply]

Thanks for the feedback, Ymblanter. I've pared back the list, and posted at project chat asking for help with re-expanding it. See Wikidata:Project chat § Help needed with l10n for bot. — PinkAmpers&^{(Je vous invite à me parler)} 21:37, 4 March 2018 (UTC)[reply]

I note that here bot picks up name of a former territorial entity, though preferred rank is set for current parish. Also, is the whole territorial hierarchy really necessary in description if there's no need to disambiguate from other villages with the same name in the same country? For a small country like Estonia I'd prefer simpler descriptions. 90.191.81.65 14:30, 4 March 2018 (UTC)[reply]

The format I'm using is standard for English-language labels. See Help:Description § Go from more specific to less specific. — PinkAmpers&^{(Je vous invite à me parler)} 21:37, 4 March 2018 (UTC)[reply]

The section you refer to concerns with in what order you go more specific in a description. As for how specific you should go it leaves it open. Apart from saying in above section that adding one subregion of a country is common and bringing two examples where whole administrative hierarchy is not shown. 90.191.81.65 07:58, 5 March 2018 (UTC)[reply]

To me, the takeaway from Help:Description is that using a second-level subregion is not required, but also not discouraged. It comes down to an individual editor's choice. — PinkAmpers&^{(Je vous invite à me parler)} 17:51, 5 March 2018 (UTC)[reply]

Comment I'm somewhat concerned about the absence of a plan to maintain this going forward. If descriptions in 200 languages for 100,000s items are being added, this becomes virtually impossible to correct manually. Descriptions can need to be maintained if the names changes, if the P131 is found to be incorrect or irrelevant. Already now default labels for items that may seem static (e.g. categories/lists) aren't maintained once the are added, this would just add another chunk of redundant data that isn't maintained. The field already suffers from absence of the maintenance of cebwiki imports, so please don't add more to it. Maybe one would want to focus on English descriptions and native label statements instead.
--- Jura 10:16, 12 March 2018 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.