Wikidata:Requests for permissions/Bot/GZWDer (flood) 2

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Succeeded by more general duscussion at Wikidata:Requests for permissions/Bot/RegularBot 2.--10:17, 8 August 2020 (UTC)

GZWDer (flood) 2

GZWDer (flood) (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: GZWDer (talk • contribs • logs)

Task/s: Create new items and improve existing items from cebwiki and srwiki

Code: Run via various Pywikibot scripts (probably together with other tools)

Function details: The work include several steps:

Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
Import GeoNames ID (P1566) for pages from w:ceb:Kategoriya:GeoNames ID not in Wikidata
Import coordinate location (P625) for pages from w:ceb:Kategoriya:Coordinates not on Wikidata‎
Add country (P17) for cebwiki items
Add instance of (P31) for cebwiki items
(probably) Add located in the administrative territorial entity (P131) for cebwiki items
(probably) Add located in time zone (P421) for cebwiki items
Add descriptions in Chinese and English for cebwiki items (only if step 4 and 5 is completed)

For srwiki, the actions are similar.

--GZWDer (talk) 13:56, 16 July 2018 (UTC)[reply]

Note: until phab:T198396 is fixed, this can only be done step-by-step and no mutliple task at a time.--GZWDer (talk) 14:02, 16 July 2018 (UTC)[reply]

Support Thank you for your elaboration! Keeping to my word now. Mahir256 (talk) 13:59, 16 July 2018 (UTC)[reply]

@Mahir256: Please unblock the bot account, I'm not goint to import more statements from cebwiki (and srwiki) until the discussion is closed and I have several other (low-speed) use of the bot account.--GZWDer (talk) 14:01, 16 July 2018 (UTC)[reply]

Yes, I did that, as I said I would do. Although @GZWDer: what will differ in your procedure with regard to the srwiki items? A lot of those places might have eswiki article equivalents (with the same INEGI code (Q5796667)); do you plan to link these if they exist? Mahir256 (talk) 14:02, 16 July 2018 (UTC)[reply]

The harvest_template script can not check duplicates and duplicates can only be found after data is imported (this may be a bug, though).--GZWDer (talk) 14:04, 16 July 2018 (UTC)[reply]

@Pasleim: Would this functionality be easy to add to the tool? It certainly seems desirable, especially with regard to GeoNames IDs. Mahir256 (talk) 14:06, 16 July 2018 (UTC)[reply]

See phab:T199698. I do not use Pasleim's harvest template tool because the tool stops automatically when meeting errors (it should retry the edit; if meeting rate limit retry after some time)--GZWDer (talk) 14:10, 16 July 2018 (UTC)[reply]

Oppose cebwiki is, as too many users concerned, the black hole of wikis. These so-called "datas" are having too many mistakes. --Liuxinyu970226 (talk) 14:15, 16 July 2018 (UTC)[reply]

Oppose Needs to do far more checking as to whether related items already exist, to add the information and sitelink to existing items if possible; and to appropriately relate the new item to existing items if not. If other items already have any matching identifiers (but are eg linked to a different ceb-wiki item), or there is any other reason to think it may be a duplicate, then any new item should be marked instance of (P31) Wikimedia duplicated page (Q17362920) as its only P31, and be linked to the existing item by said to be the same as (P460). Jheald (talk) 14:19, 16 July 2018 (UTC)[reply]
- Duplicates is easier to find after they are imported to Wikidata than on cebwiki.--GZWDer (talk) 14:24, 16 July 2018 (UTC)[reply]

@Jheald: It may be worth our time (or worth the time of those who already make corrections on cebwiki) to go to GeoNames and correct things our(them)selves so that in the event Lsjbot returns it doesn't recreate these duplicates. Mahir256 (talk) 14:34, 16 July 2018 (UTC)[reply]

- - @GZWDer: Here I give you an example that why it's also not easy to find duplication by your bot importing: Wikidata:Interwiki_conflicts/Unresolved/2017#Q28833033/Q14043515/Q13881518/Q13873847/Q13810561/Q13771124 (if this link can't work, Ctrl+F type any mentioned item number to find). --Liuxinyu970226 (talk) 14:35, 16 July 2018 (UTC)[reply]
    - @Liuxinyu970226: The issue is fixed.--GZWDer (talk) 15:16, 16 July 2018 (UTC)[reply]

@GZWDer: I try bloody hard to avoid creating new items that are duplicates, going to considerable lengths with off-line scripts and augmenting existing data to avoid doing so; and doing my level best to clear up any that have slipped online, as quickly as I can. I don't see why I should expect less from anybody else. Jheald (talk) 14:45, 16 July 2018 (UTC)[reply]

Comment given the capacity problems of Wikidata, the fact that cebwiki is practically dormant, I don't think this should be done. Somehow I doubt the operator will do any of the announcement maintenance as I think they announced that a couple of months back and then left it to other Wikidata users. So no, not another 600,000 items. For the general discussion, see Wikidata:Project_chat#Another_cebwiki_flood?.
--- Jura 20:18, 16 July 2018 (UTC)[reply]
- cebwiki is not dormant as the articles are still being maintained.--GZWDer (talk) 00:30, 17 July 2018 (UTC)[reply]
- Is there a way to see this on ceb:? I take it that any user on ceb:Special:Recent changes without a local user page isn't really active there.
  --- Jura 04:41, 26 July 2018 (UTC)[reply]

Oppose Per Jheald. Planning that it "is much easier to find such duplicates if the data is stored in a structured way", so deliberately importing duplicates (which won't be merged within a very short time) is an abuse of Wikidata and our resources. Resources spent on cleaning the mess of some origin are missing at other places to bring high quality data to other wikis and elsewhere. The duplicates are a big problem, they pop up on search and queries etc. Sitelinks might be added after data is cleaned off-Wikidata (if cleaning is feasible at all; no idea perhaps deletion of articles on cebwiki is a better solution than importing cebwiki sitelinks here). --Marsupium (talk) 23:26, 18 July 2018 (UTC)[reply]
- Duplicates already exists everywhere in Wikidata so it should not be warrented that different items refer to different concepts (though it is usually the case), and nobody should use search and query result directly without care. Searchs are not intended to be directly used by 3rd party users. For queries, if data consumer really think duplicates in Wikidata query result is an issue they can choose to exclude cebwiki-only items in query result.--GZWDer (talk) 23:45, 18 July 2018 (UTC)[reply]

Oppose Thanks a lot for your work on other wikis, it is immensely useful, but this workflow is really not appropriate for cebwiki. Creating new cebwiki items without being certain that they do not duplicate existing items creates a significant strain on the community. It is not okay to expect people to find ways to exclude cebwiki-only items in query results as a result: these items should not be created in the first place. − Pintoch (talk) 09:55, 19 July 2018 (UTC)[reply]
- probably 90% of entries are unique to cebwiki. It may be wise to import these unique entries first.--GZWDer (talk) 16:38, 20 July 2018 (UTC)[reply]
  - Well, whatever the actual percentage is, many of us have painfully experienced that it is way too low for our standards. It may be wise to be more considerate to your fellow contributors, and stop hammering the server too. A lot of people have complained about cebwiki item creations, and it is really a shame that a block was necessary to actually get you to stop. So I really stand by my oppose. − Pintoch (talk) 07:34, 21 July 2018 (UTC)[reply]
- The approach outlined above doesn't really address any of the problems with the data.
  --- Jura 04:41, 26 July 2018 (UTC)[reply]

Plan 2

The plan only does:

Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
Import GeoNames ID (P1566) for pages

Therefore:

It is easier to find articles exist in other Wikipedias by search and projectmerge (and possible mix'n'match and other tools)
Also possible to find entries from GeoNames ID, and vice versa
As no other data will be imported in plan 2, it will not pollute query results and OpenRefine (unless specifically query GeoNames ID)
Others may still import other data to these items, but only if they're confident to do so; they had better import coordinates etc. from a more reliable database (e.g. GEOnet Names Server)

--GZWDer (talk) 06:09, 26 July 2018 (UTC)[reply]

Oppose I just oppose your *cebwiki* importing, you are feel free to import Special:unconnectedpages links other than this wiki. --Liuxinyu970226 (talk) 04:45, 31 July 2018 (UTC)[reply]

@Pasleim: seems to have done quite a lot of maintenance on cebwiki sitelinks. I'm curious what his view is on this.
--- Jura 06:39, 31 July 2018 (UTC)[reply]

Oppose, this still pollutes OpenRefine results - especially when reconciling via GeoNames ID, which should be the preferred way when this id is available in the table. I don't see how voluntarily keeping the items basically blank would be a solution at all, it makes it harder to find duplicates. − Pintoch (talk) 11:54, 5 August 2018 (UTC)[reply]

Do you have experience with matching based on existing GeoNames IDs then? I still see items on a regular basis which have the wrong ID thanks to bots which imported lots of bad matches years ago (e.g. Weschnitz (Q525148) and River Coquet (Q7337301)), so it would be great if you could explain what you did to avoid mismatches so that bots can do the same. If bots assume that our GeoNames IDs are correct, they'll add sitelinks/statements/descriptions/etc to the wrong items and make a mess that's much harder to clean up than duplicates are. - Nikki (talk) 20:09, 5 August 2018 (UTC)[reply]

@Pintoch: Wikidata Qids are designated as persistant identifiers; they are still valid when the items are merged, but no guarantees should be assumed that any items (whether bot created or not) is never merged or redirected. They are plenty of mismatches in cebwiki and Wikidata (which should be solved) but creating new items will not bring any new mismatches. Also, why do you think that leaving cebwiki pages unconnected is easier to find duplicates?--GZWDer (talk) 09:28, 6 August 2018 (UTC)[reply]

@Nikki: Yes I have experience with matching based on GeoNames IDs, and it generally gives very bad results because many items get matched to cebwiki items instead of the canonical item. I don't have any good strategy to avoid mismatches and that is the reason why I regret that these cebwiki items have been created without the appropriate checks for existing duplicates. I understand that cebwiki imports are not the only imports responsible for the unreliability of GeoNames ids in Wikidata, but in my experience the majority of errors came from cebwiki. I am not sure I fully get your point: are you arguing that it is fine to create duplicate cebwiki items because GeoNames IDs in Wikidata are already unreliable? I don't see how existing errors are an excuse for creating more of them. − Pintoch (talk) 09:02, 12 August 2018 (UTC)[reply]

@Pintoch: I am arguing that we need to avoid linking the cebwiki pages to the wrong items because merges are vastly better than splits, and that will involve some duplicates. Duplicate IDs continue being valid and will point to the right item even after a merge. The same is not true of splitting and you never know who is already using the ID. I agree that it would be nice to reduce the number of duplicates it creates, but nobody seems to have any idea how it should do that without creating even more bad matches, which is why I was hoping you might have some tips. - Nikki (talk) 13:12, 12 August 2018 (UTC)[reply]

@Nikki: okay, I get your point, thanks. So, no I haven't looked into the problem myself. If I had time I would first try to clean up the current items rather than creating new ones (and you have worked on this: thanks again!). I don't think there is any rush to empty w:ceb:Kategoriya:Articles without Wikidata item, so that's why I oppose this bot request. − Pintoch (talk) 18:24, 12 August 2018 (UTC)[reply]

@GZWDer: creating new items will not bring any new mismatches: creating new items will create new duplicates, and that is what disrupts our workflows. I personally don't care about the Wikidata <-> cebwiki mapping. If you care about this mapping, then please improve it without creating duplicates (that is, with reliable heuristics to match the cebwiki articles to existing items). If you do not have the tools to do this import without being disruptive to other Wikidata users, then don't do it. If someone else files a bot request to do this task, with convincing evidence that their import process is more reliable than yours, I will happily support it. − Pintoch (talk) 09:02, 12 August 2018 (UTC)[reply]

@Pintoch: Your argument is basically "create new duplicates in any case is harmful" - but duplicates already exists everywhere, created by different users. They may be eventually merged, and their IDs are still valid. There're much more cases for no matchs found and no items will be created for them in the foreseeable future (as it is not possible to handle all 500,000 pages manually).

@GZWDer: there are three differences between other users' duplicates and your duplicates: the first is the scale (500,000 items for this proposal), the second is the absence of any satisfactory checks for existing duplicates (which is unacceptable), the third is the domain (geographical locations are pivotal items that many other domains rely on - creating a mess there is more disruptive than in other areas). This is about creating 500,000 new geographical items with no reconciliation heuristics to check for existing duplicates. This is really detrimental to the project, and I am not the only one complaining about it. − Pintoch (talk) 10:31, 19 August 2018 (UTC)[reply]

Also, what about first creating items for pages without extent items with same labels (this is the default setting of PetScan)?--GZWDer (talk) 20:12, 13 August 2018 (UTC)[reply]

I think checks need to be more thorough than that, for instance because cebwiki article titles often include disambiguation information in brackets. For instance, these heuristics would fail to identify https://ceb.wikipedia.org/wiki/Amsterdam_(lungsod_sa_Estados_Unidos,_Montana) and Amsterdam-Churchill (Q614935). − Pintoch (talk) 10:31, 19 August 2018 (UTC)[reply]

Oppose. Although I'm not aware of this being a policy so far, I believe new items should be created from the encyclopedia that is likely to have the best information on them. A bot shouldn't create new items from a Russian Wikipedia item about a US state or a US politician, and a bot shouldn't create new items about Russian city or politician from an English Wikipedia article. This restriction wouldn't necessarily apply to items that are not firmly connected to any particular, country, such as algebra for example. Jc3s5h (talk) 16:18, 30 August 2018 (UTC)[reply]
- No, this isn't a policy and it never could be. One of Wikidata's main functions is to support other Wikimedia projects by providing interwiki links and structured data. Requiring links to a particular Wikipedia before an item is considered notable would cripple Wikidata. We also can't control which Wikipedias people copy data from. We can refuse to allow a bot to run but that doesn't stop people from doing it manually or with tools like Petscan and Harvest Templates. - Nikki (talk) 12:08, 31 August 2018 (UTC)[reply]

@Ivan_A._Krestinin: In the meantime, KrBot seems to be doing this. --- Jura 10:28, 11 September 2018 (UTC)[reply]

Have no time to read the discussion. My bot is importing country (P17), coordinate location (P625), GeoNames ID (P1566) from cebwiki now. — Ivan A. Krestinin (talk) 21:24, 11 September 2018 (UTC)[reply]
- @Ivan_A._Krestinin: There is a lot of opposition to mass-creating new items for cebwiki items (see above), so you should create a new request for permissions before continuing. - Nikki (talk) 12:05, 12 September 2018 (UTC)[reply]
  - Ok, I disabled new item creation. I have code for connecting pages from different wikies. But it does not work without item creation because it is based on scheme: import data, find duplicate items, analyze data conflicts, labels and etc., merge items. — Ivan A. Krestinin (talk) 20:07, 12 September 2018 (UTC)[reply]
    - Thanks. The main issue is that people don't want duplicates. If you can explain what your bot does to avoid duplicates when you create a new request for permissions, it will hopefully be enough to change people's minds. :) - Nikki (talk) 09:00, 13 September 2018 (UTC)[reply]

If someone is creating items for all cebwiki articles, I'm still plan to add statements and descriptions to them. However for real life issues I'd like to place the request On hold until January-February 2019 and see what happens. Comments and questions are still welcome, but I am probably not able to answer it anytime soon.--GZWDer (talk) 06:10, 12 September 2018 (UTC)[reply]

@GZWDer: Since there are too many oppose comments, and already bumped privacy concerns at WMF Trust & Safety, it's unlikely that your work can be approved, so why not withdrawn it? --Liuxinyu970226 (talk) 22:41, 15 September 2018 (UTC)[reply]

Note I still have an interest - even if data quality is poor and duplicates will happen, make them happen earlier is better than latter. In Wikidata, we may just say something may exist and later people may improve them using better sources.--GZWDer (talk) 01:37, 22 July 2020 (UTC)[reply]

@Marsupium, Pintoch, Jc3s5h: Let's discuss more general aspect of mass import there: Wikidata:Requests for permissions/Bot/RegularBot 2 - I am planning to use that bot account to perform mass import.--GZWDer (talk) 08:23, 8 August 2020 (UTC)[reply]

@GZWDer: can you mark this one as withdrawn then? − Pintoch (talk) 10:12, 8 August 2020 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.