Wikidata:Requests for permissions/Bot/MsynABot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 09:39, 19 October 2020 (UTC)[reply]
MsynABot edit
MsynABot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: MisterSynergy (talk • contribs • logs)
Task/s: implement the RfC “semi-protection to prevent vandalism on most used Items”:
- protect “highly used items” (item in use on more than 500 Wikimedia pages)
- unprotect those which are no longer “highly used items” in that sense
Code: just one script/file for pywikibot; User:MsynABot/rfc-protect contains a working copy that should be available for the community even if I disappear for some reason and all external services are no longer available as well; the latest version will later be accessible in my BitBucket account including a MIT license, and on Toolforge in the corresponding msynabot
tool account (all Toolforge users will have reading rights for the script)
Function details:
- The script retrieves a list of items with high usage from the WDCM dashboard, i.e. the input is being compiled by Wikimedia Deutschland. It adds an indefinite semi-protection where missing (per RfC: in use on 500+ Wikimedia pages), and lifts indefinite semi-protection when no longer needed in that sense.
- The input data by WMDE is a cumulative count of item uses based on the wbc entity usage tables of Wikibase clients (i.e. all Wikimedia wikis). Details can be found at Wikidata:Wikidata Concepts Monitor and wikitech:Wikidata Concepts Monitor, and the WDCM code can be found here. A known limitation is that any use within the ”Structured Data at Commons” project (SDC) is not yet considered in any usage count statistics and thus ignored completely.
- An oversighted initial run will add around 23.000 page protections and lift around 50 that have been implemented by User:Abián before the RfC had started.
- Once the initial run is completed, the script will run regularly from Toolforge (once a week, same as the update frequency of WMDE’s input list) and unattended. Based on the numbers in recent weeks, I expect that around 200 new protections are necessary per week, and very few items (<5) need to have their protection lifted as they are not “highly used” any longer. I will monitor the activity of the bot account closely, of course.
- As a safety measure, the script will not add any protections if more than 1000 new protections are required per run; it will also not lift any protection if more than 10 protections are to be removed per run. This should ensure that a hick-up with the WDME-generated input does not make the bot run havoc. These safety limits are subject to being adjusted in the future, as I think they should roughly be three times larger than the average weekly workload.
- In order not to overwrite any pre-existing protections, the bot only adds protection when the item is currently completely unprotected. If any form of protection is there, it needs to expire first and will then be re-added in the next run.
- In order not to remove protections that have not been added due to “high use”, the bot only removes protections that are exactly “indefinite semi-protections”, and either implemented by User:MsynABot itself, or on a hardcoded whitelist that contains roughly 3000 entries (initial item protections by User:Abián in 2018/2019, and a couple of protections by User:MisterSynergy that were made while developing this script).
- Test actions using the script (with my regular account): see the page protection log, or Special:Diff/1284753131 (add “indefinite semi-protection”), and Special:Diff/1284752854 (remove “indefinite semi-protection”).
- The bot will run at 1 action/5 seconds, and it will respect maxlag (with value 5 sec).
- There is also a request for admin rights for this bot account at Wikidata:Requests for permissions/Administrator/MsynABot.
—MisterSynergy (talk) 18:45, 10 October 2020 (UTC)[reply]
- Discussion
- What's the process for contributor's wanting to edit these pages? The admin who previously protected such items mostly abandoned that aspect since. --- Jura 19:36, 10 October 2020 (UTC)[reply]
- The protection comment currently reads Highly used item: to be indefinitely semi-protected per Wikidata:Page protection policy#Highly used items; use Template:Edit request on the item talk page if you cannot edit this item. So,
{{Edit request}}
on the item talk page is the approach. I am open for alternatives. —MisterSynergy (talk) 19:39, 10 October 2020 (UTC)[reply] - See phab:T229100. --abián 20:19, 10 October 2020 (UTC)[reply]
- The protection comment currently reads Highly used item: to be indefinitely semi-protected per Wikidata:Page protection policy#Highly used items; use Template:Edit request on the item talk page if you cannot edit this item. So,
- What should happen when a Wikipedia changes its approach to item uses. A few wikis tend to use a large number of items for trivial things (good thing btw). If this is adopted by more, the number of uses can substantially increase with a somewhat limited risk. For, e.g., Wikiproject movies, I'd now rather discourage Wikipedias to make use of Wikidata cast lists, just to avoid that all actors end up being protected. Some users opting for 500 explicitly stated that this would only concern limited percentage ( 0.029% ). Is that ratio somehow checked? 23000 additional ones would still within that ratio --- Jura 19:36, 10 October 2020 (UTC)[reply]
- Yes, the number is likely going to increase continuously; in fact, hopefully it is going to increase, as this would mean that more Wikidata is being used in Wikimedia projects. When the RfC was created on 5 Feb 2019, the initiator User:Abián claimed that there are ~15800 items used more than 500 times. Now, 87 weeks later, there are 26351 items fulfilling this condition (3009 already properly protected). In other words: ~10500 more in total, or ~120 more on average per week.
When the number rises a lot within a week (as described above), the bot will not add any protections and I need to have a look what’s going on. However, as the scheme in Wikidata:Page protection policy#Highly used items has been found by consensus in an RfC, I think there should be another RfC to modify it in case we deem it necessary. I would of course pause the bot if someone wants to change the page protection policy accordingly. —MisterSynergy (talk) 19:52, 10 October 2020 (UTC)[reply] - You might want to re-read what the supporters of 500 had in mind. I think the sample you gave above Rita Tojeiro (Q58247035) already highlights the problem. She is co-author #271 of some publication that is frequently referenced on a single wiki. --- Jura 19:56, 10 October 2020 (UTC)[reply]
- Sure, I share your concerns; particularly since apparently all the articles citing that publication were created using a script in arzwiki (see arz:Special:EntityUsage/Q58247035). Nevertheless, the RfC which resulted in this scheme was (unfortunately) not designed to consider fine details like this one and the outcome is pretty rigid without much room for interpretation. If you have an idea for another RfC to refine the current page protection policy, we can easily wait for it to finish. I myself am not a keen supporter of this idea anyways and voted against item protections in the past RfC, and I have not really changed my mind since. —MisterSynergy (talk) 20:03, 10 October 2020 (UTC)[reply]
- I think we could semi-protect those above 500 as long as it doesn't exceed the 0.029% of items people had in mind. This avoids Wikidata becoming yet another frozen wiki .. --- Jura 06:12, 11 October 2020 (UTC)[reply]
- The RfC explicitly resulted in the hard 500 uses per item limit, not a relative one (question 3A won, not 3B or 3C). Besides this, I think a relative number is not very practical, as we’d either need to stop adding new protections once the limit was reached, or shift protections from one item to another, leaving “highly used items” with more and more uses unprotected.
I can offer to create a report page which receives an update each time the bot finishes a run. It can easily contain these numbers: timestamp, # of items eligible for protection (500+ uses), # of items actually protected under this scheme (may be fewer, as some may already be protected differently with full-protection or temporary protection), # of added and removed protections in that run, total ratio of items falling into this scheme. If more information is of interest, I can see whether I can retrieve it as well. This would help us to monitor the trend and we could start another RfC once we think that things get out of hand. —MisterSynergy (talk) 10:37, 11 October 2020 (UTC)[reply]
- The RfC explicitly resulted in the hard 500 uses per item limit, not a relative one (question 3A won, not 3B or 3C). Besides this, I think a relative number is not very practical, as we’d either need to stop adding new protections once the limit was reached, or shift protections from one item to another, leaving “highly used items” with more and more uses unprotected.
- I think we could semi-protect those above 500 as long as it doesn't exceed the 0.029% of items people had in mind. This avoids Wikidata becoming yet another frozen wiki .. --- Jura 06:12, 11 October 2020 (UTC)[reply]
- Sure, I share your concerns; particularly since apparently all the articles citing that publication were created using a script in arzwiki (see arz:Special:EntityUsage/Q58247035). Nevertheless, the RfC which resulted in this scheme was (unfortunately) not designed to consider fine details like this one and the outcome is pretty rigid without much room for interpretation. If you have an idea for another RfC to refine the current page protection policy, we can easily wait for it to finish. I myself am not a keen supporter of this idea anyways and voted against item protections in the past RfC, and I have not really changed my mind since. —MisterSynergy (talk) 20:03, 10 October 2020 (UTC)[reply]
- Yes, the number is likely going to increase continuously; in fact, hopefully it is going to increase, as this would mean that more Wikidata is being used in Wikimedia projects. When the RfC was created on 5 Feb 2019, the initiator User:Abián claimed that there are ~15800 items used more than 500 times. Now, 87 weeks later, there are 26351 items fulfilling this condition (3009 already properly protected). In other words: ~10500 more in total, or ~120 more on average per week.
- Support; thanks for your work, MisterSynergy. --abián 20:19, 10 October 2020 (UTC)[reply]
- Support; it would be good to finally get this implemented. Andrew Gray (talk) 20:36, 10 October 2020 (UTC)[reply]
- Support Nice! Some remarks on the script:
- What is the bot going to do if an admin lifts protection done by the bot?
- Sometimes entity usage could decrease temporarily or fluctuate around 500. Did you consider something like a "cooldown" period, i.e. waiting some period of time (skip one run) before lifting protection? Or moving the threshold to lift protection slightly below 500 (e.g. 490 or 495).
- You may want to make an exception for e.g. sandbox items, tour items, etc. --Matěj Suchánek (talk) 09:09, 12 October 2020 (UTC)[reply]
- Right now it would add a new indef semi-protection in the next run in case the item is in use on 500 or more Wikimedia pages (or do nothing otherwise). Should we include a
whitelistblacklist of items that should be exempt from this scheme, so that no protection is applied even when the item is used on more than 500 Wikimedia pages? If so, which criteria should apply? The RfC did not address this issue explicitly, and the page protection policy does not either. - I have considered that, but not yet implemented as long as nobody requests this to be there. The RfC does not include such a cooldown period, but based on common sense I think one could have one. A more useful cooldown limit would be 300 pages or so.
User:Abián has added around 3000 page protections before the RfC had started based on a similar evaluation as the one I am using here. The protections have not been systematically reviewed yet, but there are only 63 items with fewer than 500 uses, and only 13 of them have more than 300 uses. A cooldown limit of 300 would mean that these remain protected indefinitely, unless they fall below this limit. - Yes, will do so. Thanks!
- —MisterSynergy (talk) 09:24, 12 October 2020 (UTC)[reply]
- While I expect no admin to attempt to undo the bot's actions, there is still the possibility, so I wanted to know if this was considered. I think as long as there is consensus to have all those items protected, we can have the bot do its job.
- My point was really the case of fluctuating usage (unit changes around 500 causing repeated (un)protections). But it was again a made-up hypothetical scenario which I don't know we should ever expect (possible causes could be edit warring or sandbox experiments).
- --Matěj Suchánek (talk) 09:48, 12 October 2020 (UTC)[reply]
- On the fluctuations: if there are ~500 Wikimedia pages using an item, fluctuations of 5 or 10 pages using the item are pretty common. I would thus target for a clearly slower cooldown limit in case this is requested.
Anyways, the bot runs only once a week, based on an evaluation made by WMDE once a week, so there will usually not be more than one protection modification per week. —MisterSynergy (talk) 09:59, 12 October 2020 (UTC)[reply]
- On the fluctuations: if there are ~500 Wikimedia pages using an item, fluctuations of 5 or 10 pages using the item are pretty common. I would thus target for a clearly slower cooldown limit in case this is requested.
- Right now it would add a new indef semi-protection in the next run in case the item is in use on 500 or more Wikimedia pages (or do nothing otherwise). Should we include a
- Support Thanks for taking this on! ArthurPSmith (talk) 14:44, 12 October 2020 (UTC)[reply]
- Just as a sanity check, I took a sample from [1] (I'm assuming this is the list the bot will use). Should these really be semi-protected?
- member of the 17th Parliament of Great Britain (Q94911160) - I'm assuming this is used by the page of each member.
- fire station (Q1195942) - I'm not sure but perhaps a popular subject of Commons pictures?
- Don Quixote (Q480) - obviously well-known subject, although it's not immediately clear to me which pages would be using its data.
- arrondissement of Sarrebourg (Q702478) - no idea. (I know I could dig into the datasets to track, but I haven't done that yet.)
- I wonder if the targets should reach a minimum threshold of aggregated pageviews. Entities linked from pages with low pageviews in total would not exactly constitute a high visibility case, and have less reason to semi-protect. Having the pageview threshold would make the bot more robust against malicious attempts to create lots of spurious uses and mislead it. I'm not sure how high the threshold should be exactly, or would it be worth the effort to implement, though. whym (talk) 14:08, 13 October 2020 (UTC)[reply]
- Thank you for your comment.
- Yes, the linked csv file is indeed the input source for the bot.
- The current implementation is intentionally closely based on the RfC (i.e. community consensus) and the corresponding section in the page protection policy which was added based on it.
- I think we can change some fine details of the implementation, but we should not simply do something considerably different than what was agreed on. If anyone thinks that the current page protection policy and the section in question in particular needs an overhaul, it would now be the right time to start an RfC. I would of course wait with the bot until it is finished. If nobody wants to start an RfC now, I think we should proceed and maybe adjust the scheme later based on the experiences we are going to make.
- Regarding the listed cases: item use can be investigated by any user. Open an item page, go to "page information" in the left menu, look for the "Page properties" section and find a list of linked projects which use the item in the table under "Wikis subscribed to this entity". It links to Special:EntityUsage/Q… in the corresponding projects, which lists details about the usage in that project.
- —MisterSynergy (talk) 14:37, 13 October 2020 (UTC)[reply]
- Thank you for the response. To be clear, I didn't intend to suggest swapping the usage condition with the pageview condition. I intended to suggest having a focus within larger target set of items, perhaps only initially. 20,000+ semi-protections in a short time period seems like a drastic move. (I believe that's likely to be larger than the total number of the indefinitely semi-protected items we currently have.) Sometimes a slow start with a small focus area is wiser than doing all at once. Another approach might be to start with semi-protecting at a higher threshold like 5000+ usages, while lifting it at below 500. Having some gap like that, even a small one like 600/500 instead of 5000/500, will help preventing fluctuations, too. whym (talk) 14:29, 14 October 2020 (UTC)[reply]
- Yes, it would be an option to ramp this up in some steps over a couple of weeks. We would also be able to get used to the edit requests that may show up in the future. Some impression about items above certain usage numbers:
- Thank you for the response. To be clear, I didn't intend to suggest swapping the usage condition with the pageview condition. I intended to suggest having a focus within larger target set of items, perhaps only initially. 20,000+ semi-protections in a short time period seems like a drastic move. (I believe that's likely to be larger than the total number of the indefinitely semi-protected items we currently have.) Sometimes a slow start with a small focus area is wiser than doing all at once. Another approach might be to start with semi-protecting at a higher threshold like 5000+ usages, while lifting it at below 500. Having some gap like that, even a small one like 600/500 instead of 5000/500, will help preventing fluctuations, too. whym (talk) 14:29, 14 October 2020 (UTC)[reply]
- Thank you for your comment.
used on more than … Wikimedia pages number of items 1,000,000 47 500,000 86 200,000 253 100,000 457 50,000 805 20,000 1622 10,000 2458 5000 3950 2000 7517 1000 13274 500 26351
- I think we could easily start at 5000+ uses (<4000 page protections) and then approach 500 uses over a couple of weeks. Currently we have around 3250 indefinitely semi-protected items, of which around 3000 fall already under this "highly used item" scheme. We have indeed not used page protections in this wiki on a similar scale before. That said, based on another RfC, all ~8000 property pages are now also indefinitely semi-protected. —MisterSynergy (talk) 17:47, 14 October 2020 (UTC)[reply]
- Can you make a similar table with the number of wikis that the items are used in? Steak (talk) 07:04, 15 October 2020 (UTC)[reply]
- I don't think such numbers are available in any of the files in this folder. —MisterSynergy (talk) 08:35, 15 October 2020 (UTC)[reply]
- It is available in the MediaWiki API. Once you have the IDs of the target items, you could query the online API to filter out those with a low number of subscribing wikis, if you want. whym (talk) 12:34, 16 October 2020 (UTC)[reply]
- I don't think such numbers are available in any of the files in this folder. —MisterSynergy (talk) 08:35, 15 October 2020 (UTC)[reply]
- Can you make a similar table with the number of wikis that the items are used in? Steak (talk) 07:04, 15 October 2020 (UTC)[reply]
- I think we could easily start at 5000+ uses (<4000 page protections) and then approach 500 uses over a couple of weeks. Currently we have around 3250 indefinitely semi-protected items, of which around 3000 fall already under this "highly used item" scheme. We have indeed not used page protections in this wiki on a similar scale before. That said, based on another RfC, all ~8000 property pages are now also indefinitely semi-protected. —MisterSynergy (talk) 17:47, 14 October 2020 (UTC)[reply]
- @MisterSynergy: I'm very impressed by the whitelist system, which has dispelled the reservations about this bot task that I commented on in the RfC. However, I also see that the whitelist is hardcoded, and there is no blacklist (manually opt an item out of further MsynABot protection / unprotection). Would it be possible to include the following:
- Make the username whitelist an admin-editable wikipage
- Make the item whitelist an admin-editable wikipage
- Add functionality for an item blacklist, with the list also being an admin-editable wikipage? Deryck Chan (talk) 17:16, 13 October 2020 (UTC)[reply]
- You mention so many black and whitelists… Hope I get your comment correctly.
- Currently there are two "whitelists" for the unprotect function, in order to avoid removal of protections that have been added for another reason than "highly used item". Technically it is not simple to distinguish "highly used item" protections from any other protection. These whitelists help the bot to remove only those protections it may remove, and not any other.
- The one that you call "username whitelist" is closed, and does not need to be admin-editable for that reason; it contains "User:MsynABot" only. This means that the bot does usually not lift protections added by other admins than itself.
- However, there are exceptions, because User:Abián has added around 3000 protections earlier and those may be lifted by the bot. This is the list which you apparently call "item whitelist", but it is technically more a "protection event whitelist" (identified by log_id). This second whitelist could be expanded once some admin decided to add protections under this scheme as well, but this should rarely happen once the bot does the job and keeps the protections more or less always up-to-date. I thus don't think it needs to be an editable whitelist. However, it will be a readable text file on Toolforge in the tool account (not yet set up).
- The "blacklist" of items that should not be protected regardless of their usage is already promised to User:Matěj_Suchánek, but not yet implemented (will do so). It will at least contain sandbox items and Tour items. I think it is a good idea to make this list admin-editable, so it will reside in the bot's user namespace and I will link it from the task description later.
- Did I get this right? Or have I misunderstood something here? —MisterSynergy (talk) 18:02, 13 October 2020 (UTC)[reply]
- You understood my comment 100% correctly, thanks! In terms of why the whitelists should be admin editable, here are the possible use cases:
- MsynABot is kaputt for a few weeks. In the meantime, other admins have protected a bunch of items because they are highly used. When the MsynABot comes back to life, you'll want an updated whitelist including those pages.
- With any luck, there will be other protection bots in the future, either because you want to hand over the botmaker task to someone else, or because new functionality is required. The bots will need to interact with each other, so the user whitelist of your bot may need to be expanded to include other protection bots. Ideally you want other admins and botmakers to be allowed to configure that without asking you to change your source code. --Deryck Chan (talk) 21:19, 18 October 2020 (UTC)[reply]
- I think both situations require some sort of review by me anyways. Both whitelists should actually make sure that the bot does not lift any protections that it should not touch, and I'd rather err on the side of caution by keeping these whitelists closed. If another admin (or admin-bot) makes protections under this scheme, they can review them by themselves while I have not whitelisted them. I would expect to need some sort of review of this bot in such situations anyways. ---MisterSynergy (talk) 22:11, 18 October 2020 (UTC)[reply]
- You understood my comment 100% correctly, thanks! In terms of why the whitelists should be admin editable, here are the possible use cases:
- You mention so many black and whitelists… Hope I get your comment correctly.
- I don't like the idea to block pages used above a certain threshold number. I would instead suggest to block pages that are used in more than X wikis, where X could be something like 10 or so. This would ensure that high usage in only one or two wikis does not cause a block here. Optionally a simple threshold number could be used, but only additionally and with a higher threshold (something like 2000 or so). Steak (talk) 07:41, 14 October 2020 (UTC)[reply]
- I try to implement as closely as possible what was found as community consensus in Wikidata:Requests for comment/semi-protection to prevent vandalism on most used Items more than a year ago, and went to Wikidata:Page protection policy#Highly used items subsequently. There is only very little personal flavor in the proposed bot. As I said earlier, we could easily wait for another RfC on the matter to finish in case someone wants to start one; but this should be happening rather soon. —MisterSynergy (talk) 08:32, 14 October 2020 (UTC)[reply]
- As I added several minutes ago above, I don't think targeting a subset of the original targets is in conflict with the RFC result. It just means we are having a slow start (to see if there are any unforeseeable consequences) within what the RFC dictates. whym (talk) 14:34, 14 October 2020 (UTC)[reply]
- Continue discussion above at the table. —MisterSynergy (talk) 17:47, 14 October 2020 (UTC)[reply]
- As I added several minutes ago above, I don't think targeting a subset of the original targets is in conflict with the RFC result. It just means we are having a slow start (to see if there are any unforeseeable consequences) within what the RFC dictates. whym (talk) 14:34, 14 October 2020 (UTC)[reply]
- I try to implement as closely as possible what was found as community consensus in Wikidata:Requests for comment/semi-protection to prevent vandalism on most used Items more than a year ago, and went to Wikidata:Page protection policy#Highly used items subsequently. There is only very little personal flavor in the proposed bot. As I said earlier, we could easily wait for another RfC on the matter to finish in case someone wants to start one; but this should be happening rather soon. —MisterSynergy (talk) 08:32, 14 October 2020 (UTC)[reply]