Open main menu

Wikidata:Requests for permissions/Bot
To request a bot flag, or approval for a new task, in accordance with the bot approval process, please input your bot's name into the box below, followed by the task number if your bot is already approved for other tasks. Then transclude that page onto this page, like this: {{Wikidata:Requests for permissions/Bot/RobotName}}.

Old requests go to the archive.

Once consensus is obtained in favor of granting the botflag, please post requests at the bureaucrats' noticeboard.

Bot Name Request created Last editor Last edited
Elhuyar Fundazioa bot 2 2019-08-22, 09:11:14 Pamputt 2019-09-30, 20:45:04
TedNye 2019-09-18, 03:16:13 Tednye 2019-09-30, 10:26:17
LinkedPipes ETL Bot 2019-09-10, 10:01:05 Lymantria 2019-09-18, 05:36:03
TidoniBot 2019-08-30, 20:07:51 Jc3s5h 2019-09-27, 20:39:14
Antoine2711bot 2019-07-02, 04:25:58 Peter James 2019-09-09, 18:29:59
CoRepoBot 2019-03-18, 16:22:12 Jura1 2019-07-04, 14:26:39
Niko.georgievbot 2019-05-10, 08:15:13 Lymantria 2019-09-10, 05:29:28
PodoBot 2019-03-25, 01:57:40 Mbch331 2019-09-21, 20:24:43
Souedbot 2019-06-11, 12:36:01 Ymblanter 2019-07-10, 18:54:21
EpiskoBot_2 2019-06-26, 18:56:25 Looperz 2019-07-12, 13:11:21
EbeBot 2019-04-06, 20:39:24 Yair rand 2019-05-13, 04:55:18
NMBot 2019-03-23, 20:33:46 Jura1 2019-05-12, 19:43:16
SixTwoEightBot 2019-03-18, 00:14:07 Ymblanter 2019-10-10, 19:51:45
GZWDer (flood) 5 2019-01-19, 17:29:58 Pamputt 2019-02-17, 10:22:30
SmhiSwbBot 2018-12-19, 09:02:48 Ymblanter 2019-01-20, 20:47:11
DBDataPublisherBot 2018-12-08, 12:54:13 Vogone 2019-03-02, 22:17:45
LauBot 2018-12-01, 14:08:55 Lymantria 2018-12-16, 07:48:36
JonHaraldSøbyWMNO-bot 2018-10-25, 13:00:07 Jon Harald Søby (WMNO) 2019-10-17, 22:46:00
MewBot 2018-09-22, 09:38:20 Pamputt 2018-10-30, 21:58:48
ZbmathAuthorID 2018-08-27, 16:09:16 GZWDer 2019-07-04, 13:00:47
ScorumMEBot 2 2018-08-06, 14:39:27 Lymantria 2018-09-01, 06:04:00
GZWDer (flood) 3 2018-07-23, 23:08:28 Jura1 2019-09-29, 11:16:34
GZWDer (flood) 2 2018-07-16, 13:56:24 Liuxinyu970226 2018-09-15, 22:41:50
Crossref bot 2018-04-19, 21:12:41 GZWDer 2019-07-04, 12:52:14
WikiBot 2018-06-17, 15:10:10 Matěj Suchánek 2018-08-03, 09:09:13
PricezaBot 2018-06-14, 09:18:09 Praxidicae 2018-06-14, 19:29:22
Schieboutct 2018-04-22, 01:39:47 GZWDer 2019-07-04, 12:58:26
Wikidata get 2018-06-15, 10:51:58 GZWDer 2019-07-04, 13:00:09
Wolfgang8741 bot 2018-06-18, 02:17:10 Wolfgang8741 2018-09-05, 15:51:10
CanaryBot 2 2018-05-10, 23:46:00 Ivanhercaz 2018-05-14, 18:26:33
Maria research bot 2018-03-13, 06:15:42 GZWDer 2019-07-04, 12:55:37
AmpersandBot 2 2018-02-22, 01:43:22 Jura1 2018-03-12, 10:18:09
Arasaacbot 2018-01-15, 12:28:44 Matěj Suchánek 2018-08-08, 11:24:07
Taiwan democracy common bot 2018-02-09, 07:09:27 GZWDer 2019-07-04, 13:01:33
Newswirebot 2018-02-08, 13:00:18 Dhx1 2018-09-23, 11:53:12
KlosseBot 2017-11-17, 20:40:22 Matěj Suchánek 2018-08-03, 09:19:57
NIOSH bot 2017-11-14, 05:59:08 Ymblanter 2018-08-26, 20:33:45
Neonionbot 2017-10-19, 06:15:18 GZWDer 2019-07-04, 12:56:13
Handelsregister 2017-10-16, 07:39:42 Pasleim 2018-02-09, 08:46:30
Jntent's Bot 2017-06-30, 23:37:11 Matěj Suchánek 2018-08-03, 09:21:28
WikiProjectFranceBot 2017-05-08, 20:01:48 Lymantria 2018-05-31, 13:51:32
Jefft0Bot 2017-04-17, 15:16:29 Matěj Suchánek 2018-08-03, 09:18:22
MexBot 2 2017-06-08, 03:00:53 ValterVB 2017-06-25, 14:32:26
ZacheBot 2017-03-04, 23:29:38 Zache 2017-07-11, 11:13:15
YULbot 2017-02-21, 18:05:13 YULdigitalpreservation 2018-03-06, 13:15:37
YBot 2017-01-12, 16:43:19 Pasleim 2018-06-03, 17:52:12

Elhuyar Fundazioa bot 2Edit

Elhuyar Fundazioa bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Elhuyar Fundazioa (talkcontribslogs)


Automatically adding Basque lexemes with forms and definitions.


Function details: The work of this bot has been planned together with the Association of Basque Wikipedians. The bot will upload about 7,000 lexemes (all nouns) with their corresponding forms, 65 for each lexem. It will also provide definitions and each meaning of a lexem corresponds to a 'Sense'.

All the lexemes and definitions belong to the Elhuyar Ikaslearen Hiztegia (Elhuyar Student Dictionary), published in 2008, ISBN: 978-84-95338-96-9.

The property of Wikidata is Elhuyar Dictionary ID (

  Info you write that your bot will import definitions (senses) from the Elhuyar Ikaslearen Hiztegia. So far, there is a copyrigth on this book. Does it mean that the Elhuyar Fundazioa plans to free this book and all the data under the CC-0 licence? Pamputt (talk) 15:01, 25 September 2019 (UTC)


TedNye (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Tednye (talkcontribslogs)

I am studying migration patterns of people from Europe to USA. I want to query Wikidata using PHP.


Function details: --Tednye (talk) 03:16, 18 September 2019 (UTC)

For queries no bot flag is needed. That's for (mass) edits. Lymantria (talk) 10:49, 18 September 2019 (UTC)

My get_file_contents are denied because I dont have sufficient permissions. What is the remedy for this?

What's the URL you're trying to retrieve? Mbch331 (talk) 20:19, 21 September 2019 (UTC)

Using PHP file_get_contents('') will not return the page I can extract the page using a tool but I want to do programatically

What is the IP-address of the server you are using? I tested it and I can get the content of the page. Mbch331 (talk) 07:29, 22 September 2019 (UTC)

My home IP address is:
The goDaddy servers are,

My usage will be medium less than 100,000 queries per month produces error: SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed in $buffer = file_get_contents('');

Can someone guide me here. Is it my server's IP address that is the problem?

LinkedPipes ETL BotEdit

LinkedPipes ETL Bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jakub Klímek (talkcontribslogs)

Task/s: Repeatedly import data about Czech Veteran Trees to Wikidata from an authoritative source

Code: The component developed (GitHub) uses Wikidata Toolkit. The component can be used in various pipelines for various tasks loading data into Wikidata.

Function details:

  • The component itself can be used for generic editing of Wikidata through the Wikidata Toolkit. It is developed in scope of the Wikidata & ETL project grant and was demoed at a Wikimania 2019 workshop and poster.
  • In the current task, it will update data about Czech Veteran Trees from the authoritative source by means of an ETL pipeline using the component.
  • This means that missing trees will be added, information about existing trees will be updated, if updated in the source.
  • The pipeline will be scheduled to run periodically (probably monthly).
  • More tasks (data source) will be added as separate requests in future.

--Jakub Klímek (talk) 10:00, 10 September 2019 (UTC)

Please, perform some test edits (outside sandbox). Lymantria (talk) 05:35, 18 September 2019 (UTC)


TidoniBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Tidoni (talkcontribslogs)

Task/s: Import Birthdates and Deathdates of Identifiers already available at the Item. For example Library of Congress authority ID (P244) or Elite Prospects player ID (P2481)


Function details: Check all Entris containing P2481, check corresponding Website and if available add the Birthdate or Deathdate.

Reliable Sources:

  • Library of Congress authority ID (P244)
  • IAAF athlete ID (P1146)

Unreliable Sources:

  • Find A Grave memorial ID (P535)

Reliability to be discussed:

  • Elite Prospects player ID (P2481)
  • IMDb ID (P345)
  • Soccerway player ID (P2369)
  • player ID (P2020)

Examples: [1] [2] [3]

--Tidoni (talk) 20:07, 30 August 2019 (UTC)

  Info Please note Wikidata:Administrators' noticeboard#Unapproved TidoniBot adding erroneous dates. --Succu (talk) 21:24, 30 August 2019 (UTC)
Considering the bots proven history of being unable to determine whether the Gregorian or Julian calendar is used, I suggest the bot be forbidden from adding any date before 15 February 1923, the date the last country, Greece, switched from the Julian to the Gregorian calendar. (Other countries switched later, but from calendars that are very unlikely to be mistaken for Gregorian, such as Islamic calendars.) I also suggest the quality of each source be considered, and the approval be only for sources specifically approved in this request.
A further condition should be manual reversion of all edits already made by the bot that cannot be substantiated with reliable sources. Jc3s5h (talk) 22:29, 30 August 2019 (UTC)
w:Wikipedia:Reliable sources/Perennial sources contains a list of sources which have been extensively discussed at the English Wikipedia with respect to reliability. There is an entry for IMDb; the summary result, in the form an icon, is that the source is generally unreliable. The main discussion is located at w:Wikipedia:Reliable sources/Noticeboard/Archive 267#RfC: IMDb. A point I consider particularly significant is that IMDb has been found to contain material copied from Wikipedia; using it would create circular referencing. Jc3s5h (talk) 12:46, 31 August 2019 (UTC)
Comment. I unblocked the bot so that it can now perform test edits.--Ymblanter (talk) 18:48, 31 August 2019 (UTC)
@Jc3s5h: Did the test edits yield any comments from your side? In particular, does the Julian/Gregorian problem seem to be tackled correctly? Lymantria (talk) 06:22, 11 September 2019 (UTC)
I only reviewed edits until I found an error; I did not attempt to find all the errors in the test edits. In the edits to Alexander Borodin (Q164004) the unreliable source Find A Grave is used to assert that Borodin was born 13 October 1833 Gregorian and died 15 February 1887 Gregorian. But according to Encyclopedia Britannica these are Julian calendar dates. The quotation from Britannica is "Aleksandr Borodin, in full Aleksandr Porfiryevich Borodin, (born Oct. 31 [Nov. 12, New Style], 1833, St. Petersburg, Russia—died Feb. 15 [Feb. 27], 1887, St. Petersburg)".
Therefore the bot fails on two counts: using an unreliable source and misinterpreting the date contained in the source. Jc3s5h (talk) 16:54, 11 September 2019 (UTC)
  • I think the approach discussed with Andrew Gray in the now archived AN discussion should work. We would have four types of dates: (1) safe ones (from when all had Gregorian), (2) safe ones (from before Gregorian), (3) assumed Gregorian (between the two), except for: (4) assumed Julian (date before the conversion to Gregorian in a country likely relevant for the person). --- Jura 18:31, 11 September 2019 (UTC)
One criterion I use in evaluating bot behavior is whether the bot does exactly what the request for approval says it will do. The current version of the request for approval acknowledges Find A Grave as an unreliable source. The bot added statements from that source anyway. Therefore the bot is defective. The fact that the edits were erroneous just makes it more badly broken. Jc3s5h (talk) 19:03, 11 September 2019 (UTC)
The source is available for everyone to evaluate and clearly indicated .. The edit as such seems to copy the data accurately. --- Jura 19:09, 11 September 2019 (UTC)
The bot adds false information to the dates stated in Find A Grave. The source gives dates with no explicit calendar; implicitly, since the events occurred in St. Petersburg, Russia, during a period when the Julian calendar was in effect, the best interpretation of the source is that they are Julian calendar dates. We do not copy sources, we read sources. We understand the context of statements in a source and interpret the statements in context. A bot should be confined to a narrow domain where mere copying will result in correct statements. Since this bot is not so confined, approval should be denied.
@Tidoni: Please, comment. Lymantria (talk) 05:27, 12 September 2019 (UTC)
The bot in its current state only adds dates after 1923, so only Gregorian dates are added. So from the technical site, this should work now. I am not sure if i should start adding dates from Find A Grave memorial ID after 1923 or if these edits are not wanted, because of the unreliability of the source? In my understanding every Information should be added but if a better on is available, it should be marked as deprecated. --Tidoni (talk) 11:38, 12 September 2019 (UTC)
I think "In my understanding every Information should be added but if a better on is available, it should be marked as deprecated" is completely wrong. It should not be a goal to find and add every source that verifies a statement, only enough good sources to be confident the statement is correct. There could also be merit in adding a free on-line source when a good book is already cited, so people will not have to go to the library to verify the statement. Marking statements and sources as deprecated would only be appropriate if an erroneous statement is a wide-spread misnomer that needs to be publicly repudiated. Jc3s5h (talk) 17:24, 12 September 2019 (UTC)
@Tidoni: Please, perform a second bunch of test edits, showing no dates before 15 February 1923 (unlike this one) and sticking what you have mentioned as reliable sources. Lymantria (talk) 05:40, 18 September 2019 (UTC)
  • Personally, I think we should include it if there is no other day-precision data available. As for dates before 1923, I'd apply the three groups outlined above. Gregorian calendar start date (P7295) is still work in progress, but it should allow to do some checks. BTW, I don't think any source has a guaranteed reliability. We wouldn't need Wikidata if that was so. --- Jura 07:47, 18 September 2019 (UTC)
@Lymantria: From which of the Sources should i try a testrun? Library of Congress authority ID and Find A Grave memorial ID where set to be unreliable and shouldn't be added. The other ones wheren't discussed. Should I add examples of the other sources? --Tidoni (talk) 09:07, 18 September 2019 (UTC)
@Tidoni: I was not aware that Library of Congress authority ID was considered unreliable? Not above here at least. Neither is IAAF athlete ID. I'd consider doing a test run on these two. Lymantria (talk) 09:14, 18 September 2019 (UTC)
Some of the LOC entries have the same source as Finda.. : Wikipedia. --- Jura 21:07, 19 September 2019 (UTC)
Still, does that make LOC unreliable to extract the birth and death dates from? Lymantria (talk) 09:51, 25 September 2019 (UTC)
I think reliability varies from reference to reference in relation to a statement. Find-a-.. has the advantage that it generally reproduces the primary source used. Something that is obviously superior to the use of a secondary or tertiary references. Obviously, even a primary source can be wrong, but we have ranks to indicate that. From my personal experience, I came across more incorrect or doubtful statements from BdF than the other two, but this is probably due to the high number we have from them. None listed the primary or secondary referenced used. BTW none of the statements about the reliability above have references and as such could be seen as defamatory. I think we'd better refrain from making such generalizing claims. --- Jura 10:28, 26 September 2019 (UTC)
Articles in Wikipedia and items in Wikidata require reliable sources. Discussion pages do not require reliable sources. Discussion of the reliability of sources is essential. Any publication that indiscriminately uses Wikipedia or Wikidata as a source is itself unreliable. The only situation where I would accept the use of Wikipedia or Wikidata as a source in an outside publication is if the author is an expert in the field, references a specific version of a Wikimedia article or item, and explains why it's correct in that specific situation. Jc3s5h (talk) 20:31, 27 September 2019 (UTC)
Ideally, a primary source that is mentioned in Find a Grave should be cited directly, perhaps being described as "as quoted in" the Find a Grave item. For birth and death dates, primary and secondary sources both have their place. For older items, it may be necessary to browse through the primary source from a time when dates were certainly Julian, locate the discontinuity associated with the changeover, and also take note of the date the year is incremented. Some sources will not be extensive or consistent enough to do this (maybe each gravestone carver does his own thing).
A good secondary source will have worked through all the calendar confusion and make it crystal clear what calendar system is being used to state the dates. Jc3s5h (talk) 20:39, 27 September 2019 (UTC)


antoine2711bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Antoine2711 (talkcontribslogs)

Task/s: This robot will add data in the context of Q62382524.

Code: It's work done with OpenRefine, maybe a bit of QuickStatement, and maybe some API calls from Google Sheet.

Function details: Tranfer data for 280 movies from an association of distributors.

--Antoine2711 (talk) 04:25, 2 July 2019 (UTC)

@Anoine2711: Is your request still supposed to be active? Do you have test-/exmple-edits? Lymantria (talk) 07:18, 17 August 2019 (UTC)
  • @Antoine2711, Lymantria: It seems to be unfinished, many items created are somewhat empty and not used by any other item: [4]. @‎Nomen ad hoc: had listed one of them for deletion. If the others aren't used or completed either, I think we should delete them. Other than that: lots of good additions. --- Jura 12:57, 8 September 2019 (UTC)
*@Lymantria, Jura: The data I added all comes from a clean data set provided by distributors. I tried to do my best, but I might not have done everything perfectly. Someone spooted an empty item, and I added the missing data. If there are any other, I will do the same corrections.
My request for the bot is still pertinent as I will do other additions. What information do I need to provide for my bot request? --Antoine2711 (talk) 16:44, 8 September 2019 (UTC)
@Antoine2711: What will happen with items like Ronald Fahm (Q65116570), Ron Ladd (Q65116569) Romain Malbosc (Q65116567), Romain Lacourbas (Q65116566), Roland Bréard (Q65116565)? Currently they have no identifier, hardly any statement, no references, and no incoming links. --- Jura 17:10, 8 September 2019 (UTC)
There's a deletion request for one of these items at Wikidata:Requests for deletions#Q65119761. I've mentioned a likely identifier for that one. Instead of creating empty items it would be better to find identifiers and links between items before creating them. For example Peter James (Q65115398) could be any of 50 people listed on IMDB - possibly nm6530075, the actor in Nuts, Nothing and Nobody (Q65055294)/tt3763316 but the items haven't been linked and they are possibly not notable enough for Wikidata. Other names in the credits there include Élise de Blois (Q65115717) (probably the same person as the Wikidata item) and Frédéric Lavigne (Q65115798) (possibly the same one but I'm not certain) and several with no item so I'm not sure if this is the source of these names. With less common names there could be one item that is then assumed to be another person with the same name. Peter James (talk) 18:29, 9 September 2019 (UTC)


CoRepoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Rubenwolff (talkcontribslogs)

Task/s: Insertion and Updating of meta data collected from all online companies. Data is taken from the respective homepages or news articles. Columns include : company name, company website, founding date, legal address, government company id, founders, number of employees.


Function details: --Rubenwolff (talk) 16:22, 18 March 2019 (UTC) The Open Company Repository has the mission to become the universal search engine for companies. In the process we have collected not only descriptive text but also structured information like company name, company website, founding date, legal address, government company id, founders, number of employees. We aggregate many sources and cross reference the information filtering out any companies with inconsistencies. For newer companies we also make attempts to directly contact the founders and verify the structured information with them.

Update Template:Now :

  • For startups we have added funding information.
  • For data validation we have added WHOIS and SSH (EV / OV) cert information. The WHOIS system seems to be defunct it is almost entirely full of anonymized addresses. But EV/OV Certificates are high quality information sources for company_name, country, city
  • Legal address has been surprisingly difficult to parse out of HTML. People are very non uniform in how they format addresses even if you just take 1 country and try to build a regex for that. We are looking into NLP based solutions now.
  • Because of the address difficulty the governemnt ID effort was also stalled. Once the SSL crawlers have completed we will restart the work to intigrate with government records and then we could consider these companies of very high data quality.

The bot should be able to continuously update wikidata with new companies and attributes of those company entities. Discussion:

  • Thanks, for creating this request. I think it would be great to have this data inside Wikidata. Can you specify here how many items you want to create? For how many entries do you have information about t he number of employees?
Is this about companies located anywhere or only in the US? Can you correspondingly say more about what's in the government ID field and how you expect that to be modelled in Wikidata?
When it comes to founders, that's interesting information but I would expect that you only have their names, is that accurate? Maybe we can model that with 'unknown value' and 'stated as'.ChristianKl❫ 18:29, 18 March 2019 (UTC)
  1. How much I want to insert : > 1000000 but < 10000000 companies. I would suggest we start with a test set of 100 companies. Then insert the confidently non-small companies as defined by >=30 backlinks >50 employees which would be 100367 companies. Then I would say we should define additional requirements on the age of small companies before inserting them. There are companies which do great work, have global impact but stay <50 employees and have little online presence. Here I would look at verification of the age of the company but it warrants farther community discussion.Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  2. Currently the counts for companies which have employee information are as follows (but these numbers change daily. Sometimes they go up because the crawlers found new companies, sometimes they go down becauase we added a new data quality filter ) . Count for companies with 1-50 employees 2302172, for 51-200 employees 305004, for 201-500 employees 89283, for 501-1000 employees 42372 , for 1001-5000 employees 7476, for 5001-10000 employees 8907 and for >=10001 employees 7476. Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  3. These companies are located anywhere but my crawlers start in the English web so it is indeed alot of US companies. There are 1553420 US companies in comparison to 1195446 non US non empty companies. Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  4. The goverment ID field is their registration number in their corresponding country. I have started this work with UK companies because the UK gov provides a nice API The ID's are unique per country or country+province.For example is named "GTN LTD" with UK gov id "10775593" hence i would give it ID "gb/10775593" . For other countries like germany or the US the Gov ID's are only unique per region for example is named "Geothermie Neubrandenburg GmbH" and has DE gov id "Neubrandenburg HRB 1249". The cool guys at Open Corporates have put alot of thought into the unification of these ID's so I will coordinate with them. Open Corporates is also the only Copy Left source for many Company registries (for example the German handelsregister I can only get from them ). Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  5. For the founders I indeed do not have wiki Q ID's . In some cases I have twitter or linkedin accounts but i doubt this helps. What additional information would be required to disambiguate against existing person Q ID's ? Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
It would be worthwhile to try to match the founder with twitter/linkedin accounts given that those are sometimes available on Wikidata. When only the name of the founder is known but not any ID I would advocate to save the data as "Unknown Value" with the qualifier stated as (P1932). ChristianKl❫ 17:28, 21 March 2019 (UTC)

Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas)

  Notified participants of WikiProject Companies ChristianKl❫ 11:40, 20 March 2019 (UTC)

@Rubenwolff: are you proposing to only create new items, or also to update existing items about companies? How are you matching your companies to Wikidata? I would suggest to first start with the companies that already exist on Wikidata. What tools are you using for this upload? I think it is important to make the code/workflow open. − Pintoch (talk) 13:24, 20 March 2019 (UTC)
    • Yea I am new to wikidata no idea what tools. Was hoping to get suggestions from the community (going to the next London meetup). Rubenwolff (talk)
  • In general I'm glad somebody's working on something like this, but a test needs to be run and reviewed. The test sample should include a range of companies, some of which have entries already in Wikidata, and some of which don't. It should demonstrate how you plan to handle parent/subsidiary/business division relations, etc. ArthurPSmith (talk) 14:49, 20 March 2019 (UTC)
  • @Rubenwolff: A few questions about the dataset itself:
  • Is there, in the dataset, a persistent identifier for each company? Could that identifier be introduced as a Wikidata external-id property and used for Mix-n-match with existing companies?
  • Is there an intellectual curation process within your database to eliminate duplicates from crawling (e.g., one company with multiple homepages/domain names)?
  • What is the license for the dataset? (Could you perhaps link to an according web page?)

-- Jneubert (talk) 15:25, 20 March 2019 (UTC)

      • My data is oriented around domain names. If a company changes their domain name they will 302 it to their new domain. Additionally if a smaller company gets acquired by a larger one they also 302 to the new owner. In the case of any redirect the destination is considered as new truth and the old company/domain are deleted. (I am working on storing this relationship graph explicitly )
      • I am running a trial asking companies to review their profile but with the current response rate I don't think it would scale. So probably not by humans. But we can create requirements that any piece of information that is not from a company homepage must have at least 2 citations for example.
      • undecided about the license, I am considering GPL or MIT
  • @Rubenwolff: Great project.
    • In the case of UK companies, would your bot populate Companies House ID (P2622) in addition to the ID in the format "gb/10775593" you mentioned in your example? - PKM (talk) 20:07, 20 March 2019 (UTC)
      • Oh cool you already have this Property yea I'll put it on the task list. Rubenwolff (talk)
  • I looked at an example of a large corporate in Germany: BMW (Q26678). The data in the corepo about it seems weak: a) name is not BMW Motorcycles, b) more than 10.000 employees seems rough, it is stated to be more than 120.000. I am skeptical to import such data into Wikidata. --Zuphilip (talk) 18:45, 24 March 2019 (UTC)
      • We have categories because employee count fluctuates and for non public companies it is never a precise number. It is common to stop at 10000+ because there are very few companies that have > 100000. So yea this should be a ENUM not an INT. Rubenwolff (talk)
  • How would updates look like? Supposedly you would add a new statement with a different "point in time" qualifier? You wrote "the destination is considered as new truth and the old company/domain are deleted": how would this look in Wikidata as we don't delete the "old truth". --- Jura 14:26, 4 July 2019 (UTC)


Niko.georgievbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Niko.georgiev (talkcontribslogs)

A bot to edit the Academia Europaea members on Wikidata

Task/s: 1.Add missing qualifiers to members (some members only have start date, others only affiliation) 2.Fix sources, most of the sources point to the wrong URL(still works, but its still wrong)


Function details:

Edits items based on the OpenRefine file which was used to add all of the info on Wikidata .Edits only "member of" -> Academia Europaea claims and only adds qualifiers if they are missing , for the future it will also remove duplicate qualifiers since those can also be found.

--Niko.georgievbot (talk) 08:15, 10 May 2019 (UTC)

@Niko.georgiev: If you are still interested in this task, please let the bot make some test edits. Lymantria (talk) 05:29, 10 September 2019 (UTC)


PodoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Lettalk (talkcontribslogs)

Task/s:ability to get properties of a certain entity item from the wikidata page


Function details:Get the properties of the items that are interested in this project. --Lettalk (talk) 01:57, 25 March 2019 (UTC)

@Lettalk: Do I understand correct you're only planning on retrieving data and not modifying data? Mbch331 (talk) 20:24, 21 September 2019 (UTC)


Souedbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Soued031 (talkcontribslogs)


Do changes about Luxembourg or entities that contain Luxembourgish.


Function details: --Souedbot (talk) 12:35, 11 June 2019 (UTC)

@Soued031:, if you are still interested in the task, please make some test edits.--Ymblanter (talk) 18:54, 10 July 2019 (UTC)

EpiskoBot 2Edit

EpiskoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Looperz (talkcontribslogs)



Function details:

--Looperz (talk) 18:55, 26 June 2019 (UTC)

  • Yeah, please do that. I fixed it on Q450675---- Jura 03:59, 29 June 2019 (UTC)
  • For the moment I am waiting for a decision, because this question affects about 34.000 Pages. With often not only one claim for consecrator (P1598). --Looperz (talk) 10:14, 30 June 2019 (UTC)
    • I think you should stop misusing the qualifier. You probably added some 8000 since the problem was mentioned to you. If you think your approach is correct, you might want to seek additional feedback on project chat. The bot approval is more technical in nature. Personally, I tend to oppose additional requests by users who are known not to cleanup their bot tasks. --- Jura 10:18, 30 June 2019 (UTC)
Thank you for your opinion, Jura. Where should I go an get other opinions? Since it is "just" one single opinion, this is no reason for a full edit stop or change. --Looperz (talk) 23:56, 30 June 2019 (UTC)
I know it's just your single opinion, but even so I don't think you bring much to support it either. Project chat is at Wikidata:Project chat. --- Jura 00:28, 1 July 2019 (UTC)
Your proposal for using object has role (P3831) is just a single opinion, too. subject has role (P2868) is at least an auto suggested qualifier. The sentence "I tend to oppose additional requests by users who are known not to cleanup their bot tasks" is an accusation I have to contradict to. I offered a change even by mass edit as soon as I get a common decision for that subject-object confusion. --Looperz (talk) 03:40, 1 July 2019 (UTC)
Well, I noticed you ignored Ahoerstemeier's opinion and autosuggestion might just come from your bad edits. --- Jura 08:09, 1 July 2019 (UTC)

Meanwhile object has role (P3831) has the majority and i am really looking forward to a change of that autosuggestion thing :-) --Looperz (talk) 13:10, 12 July 2019 (UTC)


EbeBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ebe123 (talkcontribslogs)

Function details

Takes election data from Elections Canada and the Library of Parliament that I process in Excel to create new items or add candidacy in election (P3602) statements. Could the flooder flag be provided for the 3rd point? Ebe123 (talk | contributions) 20:39, 6 April 2019 (UTC)

  • I don't think P4100 should be used for the list a candidate is included on. Parliamentary groups would be the ones that are formed once a candidates is actually elected. Most of the time, this would be the list one, but it needn't be. Neither is the parliamentary group necessarily identical to the political party. --- Jura 04:10, 7 April 2019 (UTC)
    I don't understand your point about lists nor about parliamentary groups, but most candidates run as a member of a party (parliamentary group), and which party they run as. How they run is what is recorded, and is identical to political party... Ebe123 (talk | contributions) 12:29, 7 April 2019 (UTC)
    Most, but you are amalgamating three different things. Is the Canadian term for "parliamentary group" "caucus"? --- Jura 13:00, 7 April 2019 (UTC)
    Yes. Ebe123 (talk | contributions) 13:15, 7 April 2019 (UTC)
    News about the Trudeau/Wilson-Raybould/Philpott story are sometimes a bit sketchy, but isn't the expulsion from the caucus, not the party? --- Jura 14:42, 7 April 2019 (UTC)
    They ran as liberals, which would be the value used for P4100, but they can't run as Liberals next election. Ebe123 (talk | contributions) 20:46, 7 April 2019 (UTC)
    @Jura1:, are you satisfied with the answers?--Ymblanter (talk) 19:04, 10 April 2019 (UTC)
    I still think the planned property use is incorrectly combining different aspects. To me, it seems clear from the Canadian sample. In countries where there are more than 2 or 3 parties, parliamentary groups generally combine several parties, but members still get elected on a list for a specific party. --- Jura 07:14, 13 April 2019 (UTC)
    You're talking about coalitions, which are formed after the election, and uses political coalition (P5832). What property do you think better represents the party for which a candidate/nominee runs with? Ebe123 (talk | contributions) 13:45, 13 April 2019 (UTC)
  •   Support -SixTwoEight (talk) 00:46, 10 April 2019 (UTC)
  • @Rhadamante‎, Serpicozaure: who seem to be working on parliamentary groups. --- Jura 08:25, 13 April 2019 (UTC)
    I agree with Jura, P4100 is not the thing to use here. Parliamentary groups, parties, coalitions, electoral lists re differnts thungs. Even for countries with Westminster system.
    • Parliamentary groups are exclusively for elected people sitting together in a parliamentary assembly. Most of the time they are from the same party, but sometimes not, one can have been elected as an independent, or from a party allied that has not the minimum required number of elected people to form its own group (for the case, I think what's going in he local parliament in Québec illustrates the concept pretty well)
    • Coalition can have different ways. It can be an electoral coalition of parties making a common electoral list together, while still being separate; eg in Spain, that was the case for Unidos Podemos (Q24039754), in Greece for SYRIZA (Q222897) (before transforming themselves into a party); in France, for the upcoming European election, there is even a coalition Socialist Party (Q170972)/Place publique (Q58366009)/New Deal (Q15629523) saying that their future deputies will decide individually in which parliamentary group they will sit... It can also be an alliance afterwards, political parties elected separately, sometimes with the clear goal to govern together in the end (that was at some point the case with the libdem/Cons in the UK or CDU/FDP in Germany in the noughties), sometimes not (for instance the current coalition Conservative Party (Q9626)/Conservative Party (Q9626) in the UK, or Five Star Movement (Q47817)/Lega Nord (Q47750) in Italy, not to mention Greece, Spain, Portugal, Germany, the Netherlands, Austria, Belgium...)
    But back to he subject, no, P4100 cannot be used for electoral list. Either use the parties, colitions, custom items, or even nothing, but not P4100. Rhadamante (talk) 20:24, 13 April 2019 (UTC)
    Je ne comprends pas trop votre point ; voici un exemple de comment j'utilise(rais) la propriété : (Justin Trudeau (Q3099714))
candidacy in election
  2015 Canadian federal election   edit
votes received 26,391
electoral district Papineau
parliamentary group Liberal Party of Canada
▼ 0 reference
+ add reference
+ add value
  • Le Canada n'a pas un système de liste électorale (sauf si l'on dit qu'un candidat par parti par circonscription fasse une liste). Comment représenterez-vous (avec quelle propriété) la relation entre parti et candidat avec une circonscription pour sa candidature lors d'une élection ? Ebe123 (talk | contributions) 00:27, 14 April 2019 (UTC)
    Je décortiquerais votre point sur le Québec (même si je ne suis pas trop familier) : Si vous parlez de 2018 Quebec general election (Q17001196), la Coalition Avenir Québec (Q2348226) est un parti et non une coalition de plusieurs (provincial, pas local). De plus, le minimum d'élus d'un parti pour ce faire considéré "officiel" est fédéral (12, donc Parti Québécois (Q950356) ne qualifierais pas dans ce parlement), mais c'est pas trop pertinent vu que j'utilise les partis enregistrés auprès l'institution d'élections, ce qui n'a pas de critères de succès. Ce serait du révisionnisme de changer les détails d'une candidature due à la formation après du parlement.
    J'avais changé la propriété de represents (P1268) à parliamentary group (P4100) car j'avais trouvé qu'il y aurait moins de confusion. Croyez-vous que P1268 est plus approprié ? Ebe123 (talk | contributions) 01:03, 14 April 2019 (UTC)
    Je n'ai jmais parlé de la CAQ. Mon propos sur la coalition n'avait autre raison que votre mésusage du terme dans la discussion plus haut. Et concernant la politique québécoise, je me référais, notamment, à Catherine Fournier quittant le PQ, ce qui aurait normalement du entraîner la disparition de son groupe parlementaire. Mais peu importe. L'exemple donné plus haut est un parfait exemple de contre-sens : le Liberal Party of Canada (Q138345) n'est pas un groupe parlementaire. Il ne doit jamais, comme les autres partis, servir à remplir P4100. Et de toutes façons, comme je l'ai dit plus haut, c'est un non-sens d'utiliser P4100 pour notifier l'affiliation d'un candidat à une élection à son parti, ou du moins au parti dont il a obtenu l'investiture. P4100 sert à renseigner dans quel groupe parlementaire siège un élu durant la législature où il est élu. Ce qui peut être susceptible de changer par ailleurs, cf, le cas de Catherine Fournier. Et puis, dans cette logique, comment qualifier Jean-Martin Aussant ou les deux député QS à l'élection de 2012 puisqu'il n'y a jamais eu de groupe parlementaire ON ou QS dans la législature qui a suivi ?
    represents (P1268) ne me semble pas adéquat non-plus. Un élu représente la population du territoire (ou le "territoire", quoi que ça veuille dire, dans le cas des sénateurs, par exemple en France ou aux États-Unis) où il est élu, pas son parti. Pourquoi ne pas utiliser plus simplement member of political party (P102) ?
    Rhadamante (talk) 04:50, 14 April 2019 (UTC)
    member of political party (P102) me va, c'est juste que ce n'était pas permis avec candidacy in election (P3602) (je l'ajouterais). À part, y aurait-il une situation où parliamentary group (P4100) est compatible avec candidacy in election (P3602)? Ebe123 (talk | contributions) 18:55, 14 April 2019 (UTC)
    J'en doute. ça serait bien si on avait un récapitulatif des variantes et de la terminologie locale applicable. Peut-être l'UIP peut nous aider. --- Jura 19:05, 14 April 2019 (UTC)
    Absolument jamais. un groupe parlementaire relève de l'organisation interne d'un parlement, donc de gens déjà élus. Il n'a aucun rapport avec l'élection. Rhadamante (talk) 22:03, 14 April 2019 (UTC)
    Ce dont je pensais. J'ai ajouté la contrainte contre l'utilisation de cette propriété.
  • J'ai re-mis le troisième point pour le transfer de qualifiants (??). Êtes-vous satisfaits avec ma demande de bot ? Ebe123 (talk | contributions) 00:32, 20 April 2019 (UTC)
    @Jura1:@Rhadamante:--Ymblanter (talk) 18:13, 5 May 2019 (UTC)
    • How about a couple of test edits? --- Jura 18:32, 5 May 2019 (UTC)
  • I don't think that the national election should be the target of candidacy in election (P3602). Each candidate runs in an election in their own riding. Each riding's election should have its own item, and those items should be the targets of candidacy in election (P3602). --Yair rand (talk) 23:59, 8 May 2019 (UTC)
    Each riding only forms a part of the full election, and so electoral district (P768) represents well what part of the election has been contested by the person. Ebe123 (talk | contributions) 03:40, 13 May 2019 (UTC)
    Each riding's election has a distinct electorate (P1831), a number of ballots cast (P1868), a set of candidates, a successful candidate (P991), a number of valid votes, and so on. I think it's quite clear that to store the relevant information, each election which is part of the broader election requires its own item, which should be the target of candidature statements. --Yair rand (talk) 04:55, 13 May 2019 (UTC)


NMBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Notme1560 (talkcontribslogs)

Task/s: Remove brackets indicating translation around English scholarly article labels titles

Code: GitLab repository has scripts and queries to be copied into pywikibot installation.

Function details:

Requested December 2018.

Selects scholarly article (Q13442814) with a PubMed ID (P698) claim and an English label starting with [ and ending with ]. The English label is converted from [XXX]. to XXX.

If no title claim exists, it currently exits (uses the content of the old title to build the new title) but this can be refactored later. If multiple title claims exist, it also exits (doesn't handle deprecating multiple previous claims) but this can also be refactored later. Otherwise, the existing title claim is set to deprecated and a new claim with the correct format is added. --Notme1560 (talk) 20:33, 23 March 2019 (UTC)

How do you want to reflect that this a) a translation and b) of which language? What about title (P1476)? --Succu (talk) 20:52, 23 March 2019 (UTC)
All these articles are from PubMed, but I'm not sure who imported them and when they were imported. The titles with brackets in PubMed are supposed to indicate that the displayed title has been translated to English, but this bot doesn't have access to the untranslated/original titles since it doesn't pull from PubMed's API (only editing existing items). The original language/title information should be on the PubMed page accessible through the PubMed ID (P698) claim which links to the site: (ex) [Article in Portugese] source on the site so it can be shown there. Other than that, I'm not sure. --Notme1560 (talk) 21:11, 23 March 2019 (UTC)
I know. Hence   Oppose --Succu (talk) 21:58, 23 March 2019 (UTC)
The original title and language can be retrieved from the XML version. In this case <Language>por</Language> <VernacularTitle>Estudo caso-controle com resposta multinomial: uma proposta de análise.</VernacularTitle>. Emijrp (talk) 09:22, 24 March 2019 (UTC)
Thanks Emijrp, I guess I will have to integrate the PubMed API now and I guess I can pull other missing article data now as well. I guess this can be closed and I can create a new request with the new tasks and details later. --Notme1560 (talk) 19:09, 24 March 2019 (UTC) (sig added hours later, forgot to sign)
I don't see anything wrong with fixing these English labels that are clearly wrong. There's no assertion anywhere that the label is the actual original title of the paper, we have other properties to state that sort of thing. That said, it would be nice to get title in the original language as well. It would also be nice if somebody could fix the rather substantial number of these which have been added with NO label in any language! I'm not sure how they even did that... ArthurPSmith (talk) 17:36, 25 March 2019 (UTC)
Sometimes CrossRef provides no title information... Is enWP preferring translated titles as labels? My question above remains unanswered. ([Case-control studies with multinomial responses: a proposal for analysis]. (Q27687073)). --Succu (talk) 20:21, 25 March 2019 (UTC)
  • I don't think it's Crossref that's the problem - here are examples with only a Pubmed ID: Q58595485 and Q61049189. SourceMD must be doing some over-filtering and then somehow creating items with no label at all!? ArthurPSmith (talk) 12:17, 26 March 2019 (UTC)
  • Somehow I missed this request. Thanks for doing it! --- Jura 19:43, 12 May 2019 (UTC)

GZWDer (flood) 5Edit

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Mass import of lexemes from various (reliable) sources.

Code: Not available for now

Function details: For a source in a specific language:

  1. Export the word list (with part of speech, and forms if possible)
  2. Use SPARQL to find any existing lexemes that may be duplicates. Remove them.
  3. Create lexemes for non-existent ones.
  4. If the source is in public domain, also add senses. (For existing lexemes, senses may be created if no senses exists. Lexemes with existing senses will be skipped.) If the source is copyrighted, only the words will be imported.

--GZWDer (talk) 17:29, 19 January 2019 (UTC)


  •   Oppose I think every source should be discussed separately. KaMan (talk) 17:42, 19 January 2019 (UTC)
  •   Comment A. Are we allowed to mass-import lexemes? I didn't think that was permitted yet. B. Wikidata's Lexeme structure is different from what most sources would have - I think we REALLY need to see a good set of sample edits (and perhaps the source code too) before starting on this. Definitely do not allow this without samples of the bot's work. And each separate source should be requested as a separate task (and samples provided before approval). ArthurPSmith (talk) 19:41, 19 January 2019 (UTC)
    • As currently we don't have consensus for mass-import lexemes, I'm filing a request to obtain one. Also it is easy not to import duplicates as long as we check the existing list of lemmas.--GZWDer (talk) 20:14, 19 January 2019 (UTC)
      • Ok, specify an example source and let's see at least 10 proposed examples either implemented or with enough detail that we can tell what you are doing. Lexemes are more than just words so just importing a word list is NOT what we want here. How do you determine lexical category? How do you generate forms? How do you check for alternate representations (spelling variants)? These are important details! ArthurPSmith (talk) 16:11, 1 February 2019 (UTC)
      • One example of a potential source that has data structured in a reasonably similar fashion might be WordNet (for English). The words it includes could be considered lemmas for lexemes, as they deliberately remove all inflected forms. However that means it could not itself be a source for those forms. It groups words into synonym sets so that senses could be generated I think somewhat automatically from them. So this would be an interesting collection to pursue via automation. But there are still a lot of details that would need to be examined to make sure we were doing something sensible with the automated import. ArthurPSmith (talk) 14:48, 5 February 2019 (UTC)
  •   Oppose per Arthur in full. Mahir256 (talk) 21:53, 4 February 2019 (UTC)
  • I have created some example lexemes. Comments welcome, but more significant work will not start until July. In the future probably millions of lexemes will be imported (as much sources are copyrighted, we may expect lexemes without senses as I will not import them.)--GZWDer (talk) 23:54, 14 February 2019 (UTC)
    • @GZWDer: Thanks! So what you've done here looks reasonable to me. I'm certainly not familiar with Welsh, but it looks like you are importing Welsh verb infinitives as lexemes from an out-of-copyright Welsh-English dictionary, adding the English senses as sense S1 on each one. No forms, or secondary senses. So language and lexical category are clear, and it appears this source wouldn't have multiple forms for the same lexeme in different places, so we shouldn't need to worry about duplication there. The one thing that might be nice to add would be at least one form (presumably this dictionary is using a standard form for the verbs?). Also if you are planning to add other lexemes from this source it would be nice to see how you would handle other lexical categories - is "Adar, n. p. birds, fowls" the plural form of "Adain, n. a wing; a bird" or are they different lexemes? It also looks like this source doesn't include any proper nouns so we don't have to worry about capitalization issues. Anyway, in general I'd   Support this particular case, but I think there's still a bit more to work out with it. ArthurPSmith (talk) 15:28, 15 February 2019 (UTC)
      • Further comment here - I'm not sure if you've imported Lexeme:L42622 correctly - the source seems to use a semicolon character (';') to indicate separate senses, so I think that should be 2 senses, not 1, in this case. See the next entry - 'Absenwr, n. m. backbiter; absentee; slanderer' where 'absentee' is clearly a distinct meaning. Many more examples further down, such as 'Ach, n. f. a fluid liquid; a stem' which are even clearer on this. Also it would be nice if the source could be directly referenced as the source of the gloss on the sense. Not sure we have a mechanism to do that right now. ArthurPSmith (talk) 15:47, 15 February 2019 (UTC)
      • The workflow of import may be improved (senses will be split); for now, the 100 entries I have imported may be manually fixed. --GZWDer (talk) 12:26, 16 February 2019 (UTC)
        So do you plan to correct them manually? KaMan (talk) 12:43, 16 February 2019 (UTC)
        • Corrected.--GZWDer (talk) 13:06, 16 February 2019 (UTC)
    • I think it's time to move ahead with bot created items. I'm not really convinced the project has progressed much in recent months as far as lexemes concerned. The above can give a much needed fresh productive contribution. @Llywelyn2000: what do you think of the newly created Welsh language lexemes? --- Jura 16:41, 15 February 2019 (UTC)
  • Hesitating between   Support and   Wait.   Comment interesting but the examples like adgyffroi (L42717) need some work to reach what I think is the minimal level. Lexemes should always have at least one form (the main lemma), described by source (P1343) is very good but could we have the page(s) (P304) too and maybe it would be even better to have several value in described by source (P1343), it would tackle all the copyright and reliability problems. PS: a native speaker review is a condition 'sine qua none' (especially as the language has changed a lot for some languages, if you would import the Lexique étymologique du breton moderne (Q19216625) - I'm working on it on Wikisource right now with the plan to import it on Lexemes one day ;) - a lot of lemma would be to rectified before import). Cdlt, VIGNERON (talk) 16:43, 15 February 2019 (UTC)
  • Note I also plan to import copyrighted sources - but only words themselves, not any definitions, so we will have many lexemes without senses. Anyway any further action will be after July. By the way, many sources I found does not have part of speech information, so we may want to set up something like mix'n'match to handle them (this is also useful for online resources like Wiktionary where entries are not fully reliable).--GZWDer (talk) 12:26, 16 February 2019 (UTC)
    That's why I think every import source should be discussed separately, not in one request for permission. KaMan (talk) 12:43, 16 February 2019 (UTC)
  •   Oppose I agree with KaMan, each import source should be discussed separately. Please open new requests for permissions for each source. Pamputt (talk) 10:22, 17 February 2019 (UTC)


SmhiSwbBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: SmhiSwbBot (talkcontribslogs)

Task/s: SMHI (Swedish Meteorological and Hydrological Institute) wants to upload surface water bodies information to wikidata. Apart from uploading, the data needs be be updated regularly as well. Code: We reappropriate the following code: Function details: --SmhiSwbBot (talk) 09:01, 19 December 2018 (UTC)

@SmhiSwbBot:, please make some test edits. I assume the database you are planning to upload is licensed as CC-0.--Ymblanter (talk) 20:47, 20 January 2019 (UTC)


DBDataPublisherBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GrafLevenshtein (talkcontribslogs)

Task/s: This bot is meant as a basis to import data officially provided by the Deutsche Bahn as part of the open data initiative on [5].

Code: github

Function details: For now the bot only uses the StaDa API to access the official station data. From the he exports:

  1. the official name as an aditional alias if it is not already the label or an alias in German
  2. the given postal address if it is not already in Wikidata
  3. the federal state as a location if there is no location specified in Wikidata
  4. the station category if not already set
  5. the geo coordinates if not already set or if the value in the DB dataset is more precise
  6. any station code that is not already set

He sets "" as the url source for changes 2-6.

--GrafLevenshtein (talk) 12:53, 8 December 2018 (UTC)

What license is the StaDa data available under? - Nikki (talk) 21:01, 8 December 2018 (UTC)
Creative Commons Attribution 4.0 International (CC BY 4.0) --GrafLevenshtein (talk) 12:44, 10 December 2018 (UTC)
I don't think that would be compatible with Wikidata's CC0 license. :( - Nikki (talk) 15:20, 11 December 2018 (UTC)
@GrafLevenshtein: I agree with the finding, that the licence is incompatible. You may try to seek permission to publish data here under Wikidata's CC0 licence, otherwise we cannot proceed with this request. --Vogone (talk) 22:17, 2 March 2019 (UTC)


LauBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Laurentius (talkcontribslogs)

Task/s: stuff related to aligning Italian Wikipedia and Wikidata data.


Function details:

For quite a while, I've done some work on getting data from Italian Wikipedia to Wikidata (which includes adapting local templates, importing data, cleaning data). For most of that a local bot plus tools like harvest_templates and QuickStatements are fine (since one can just write a bot that creates QuickStatements instructions, although it feels a bit silly), but in some cases it's pretty limiting. I'm asking for a bot flag on LauBot (which has been a flagged bot on Italian Wikipedia since 2014) and authorization to do the same things in a more flexible way.

I will make some test edits shortly. --Laurentius (talk) 14:08, 1 December 2018 (UTC)

Please, be aware of Wikidata:Bots, which among more requires that authorization of bots is done task by task. So could you specify the bot task you wish to perform? Lymantria (talk) 07:48, 16 December 2018 (UTC)


JonHaraldSøbyWMNO-bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jon Harald Søby (WMNO) (talkcontribslogs)

Task/s: Add items (and keep them up-to-date) from the Sami bibliography from the National Library of Norway to Wikidata.

Code: Not published yet (but will be eventually, see phab:T205631 et al)

Function details: As part of Wikimedia Norge's Northern Sami project, we have prepared an import of the data from the Sami bibliography to Wikidata. The Sami bibliography is a listing of all works published in Sami languages or about Sami people/culture in Norway. It contains around 26,000 work editions with plenty of metadata, and the items will be structured according to the standards laid out in Wikidata:WikiProject Books. I am also planning to write a script to keep the data up-to-date, but the first priority is doing the import. --Jon Harald Søby (WMNO) (talk) 12:59, 25 October 2018 (UTC)

  • interesting. Please do some tests once ready. --- Jura 13:54, 25 October 2018 (UTC)
  • Please do some 100 test edits. Lymantria (talk) 12:03, 2 December 2018 (UTC)
Hi Jura1 and Lymantria, I'd like to resurrect this request. Other things got in the way, but I'm all good to go now. I've ended up settling for QuickStatements to actually do the edits, but when I try to do the test edits it says that I can't because the bot account isn't autoconfirmed. Could you possibly put it into the "confirmed users" group so I can do the test edits? Thanks. Jon Harald Søby (WMNO) (talk) 22:46, 17 October 2019 (UTC)


MewBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Rua (talkcontribslogs)

Task/s: Importing lexemes from en.Wiktionary in specific languages


Function details: The bot will be used to parse entries from English Wiktionary using pywikibot and mwparserfromhell, and then either create lexemes on Wikidata, or add information to existing lexemes. Care is taken to not duplicate information: the script checks if the lexeme exists and already has the desired properties and only adds anything if not. In case of doubt (e.g. multiple matching lexemes already exist) it skips the edit. I made some test edits using my own user account, they can be seen from [6] to [7]. Today I did a few on the MewBot account.

Individual imports will be proposed with the lexicographical data project first, as it has been said by the project leaders to be careful with imports at first. The current proposal is for Proto-Samic and Proto-Uralic lexemes, seen at Wikidata talk:Lexicographical data#Requesting permission for bot import: Proto-Uralic and Proto-Samic lexemes. Once the project leaders give the ok for all imports, permission will no longer be needed for individual imports. Planned future imports are for Dutch and the modern Sami languages. --—Rua (mew) 09:37, 22 September 2018 (UTC)

I am ready to approve this request in a couple of days, provided that no objections will be raised meanwhile. Lymantria (talk) 05:27, 25 September 2018 (UTC)
I just noticed that Wikidata:Bots says I need to indicate where the bot copied the data from. How do I indicate that the data came from Wiktionary? —Rua (mew) 10:51, 25 September 2018 (UTC)
Could you run your bot on few entries in order to evaluate it? Thanks in advance. Pamputt (talk) 10:59, 26 September 2018 (UTC)
I did, already. Do I need to do more? —Rua (mew) 11:02, 26 September 2018 (UTC)
  Oppose Ah sorry I did not check before asking. For all reconstructed form, I think a reference is mandatory. As these "words" do not exist, these "words" come from specialist's work and have to be sourced. Two linguists may reconstruct different forms. That's said, I am not sure about copyright issue for reconstruct form. It probably belongs to public domain as a scientific work but it would be better to be sure. Pamputt (talk) 21:42, 26 September 2018 (UTC)
Not all reconstructions on Wiktionary can be sourced to some external work. Some were reconstructed by Wiktionary editors. This is because not all reconstructed forms are available in external works, and we have to fill the gaps ourselves. The bot adds links to Álgu and Uralonet if one exists. —Rua (mew) 22:26, 26 September 2018 (UTC)
I strongly disagree to import reconstructed forms that do not come from scientific works. One need criteria to accept such forms and academic paper is a good one. Otherwise, anyone can guess its own form. So if you run your bot, please import only "validated" forms. Pamputt (talk) 14:18, 27 September 2018 (UTC)
I agree with that. Only sourced reconstructed forms should be imported. Unsui (talk) 15:50, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them. The criteria used by Wiktionary is that they follow established sound laws. Some reconstructions from linguistic sources don't pass that criterium. It fits with the general policy in Wiktionary of not blindly copying from dictionaries but making sure that forms make sense. Reconstructions that are questionable, whether from an external source or not, can be discussed and deleted if found to be invalid. If you have doubts about any of the reconstructions in Wiktionary, you should discuss it there.
That said, what should be done if words in different languages come from a common source, but there is no source that gives a reconstruction? Can lemmas be empty? —Rua (mew) 15:54, 27 September 2018 (UTC)
Here are some cases where Wiktionary has had to correct errors and omissions in sources. I provide a link to Wiktionary, and a link to Álgu, which gives its source.
...and many more. So you see if we have to rely on sources, we become vulnerable to errors, whereas we can correct those errors on Wiktionary, making it more reliable. If Wikidata can't apply the same level of scientific rigour then that is rather worrying. —Rua (mew) 16:42, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them.
This is maybe the case on the English Wiktionary but on the French Wiktionary, original works for etymology are not allowed, every etymological information have to be sourced. Yet Wikidata has to define its own criteria and about reconstructed form, nothing has been decided so far. About you question "what do we do when a source give a wrong information", I would say in this case, we set a deprecated rank. Pamputt (talk) 19:05, 27 September 2018 (UTC)
You say, for exemple, "North Sami requires final *ā". OK but why not *ö ? Because linguists have defined laws for this langage. It is always linguists works. Hence, it is possible to put a reference. Otherwise anything may be created as a reconstructed form. Unsui (talk) 07:16, 28 September 2018 (UTC)
That's nonsense. It still has to stand up to scrutiny. —Rua (mew) 10:02, 28 September 2018 (UTC)
  • For how many new ones is this? --- Jura 11:11, 26 September 2018 (UTC)
  •   Oppose for now. It's unclear how many would be imported and we need to solve the original research question first. --- Jura 08:03, 27 September 2018 (UTC)
    Can you elaborate? I don't see what the problem is. —Rua (mew) 10:07, 27 September 2018 (UTC)
    Apparently, you don't know how many you plan to import. --- Jura 10:12, 27 September 2018 (UTC)
    I gave a link to the categories in the other discussion. —Rua (mew) 10:20, 27 September 2018 (UTC)
    • Can you make a reliable statement? Categories tend to evolve and change subcategories. --- Jura 10:22, 27 September 2018 (UTC)
    wikt:Category:Proto-Samic lemmas currently contains 1303 entries. —Rua (mew) 10:25, 27 September 2018 (UTC)
  • I've made a post regarding the import and the conflict in Wiktionary vs Wikidata's policies: wikt:WT:Beer parlour/2018/September#What is Wiktionary's stance on reconstructions missing from sources?. —Rua (mew) 17:36, 27 September 2018 (UTC)
  • Is there any news on this? —Rua (mew) 10:08, 17 October 2018 (UTC)
    @Jura1:, are you fine now with the approval of this bot?--Ymblanter (talk) 13:01, 21 October 2018 (UTC)
    • I will try to write something tomorrow. --- Jura 18:21, 21 October 2018 (UTC)
    • First: sorry for the delay. The question what to do with lexemes reconstructed at Wiktionary remains open. In general, we would only import information from other WMF sites when we know or can assume that it can be referenced to other quality sources. This isn't the case here. One could argue that Wiktionary is an independent dictionary website and should be considered a reference on its own. Whether or not this is the case depends on how Wikidata and the various Wiktionaries will work going forward. The closer Wiktionary and Wikidata would work together going forward the less we can consider it as such. --- Jura 04:14, 25 October 2018 (UTC)
      • The majority of the Proto-Samic entries on Wiktionary does have an Álgu ID (P5903). Proto-Uralic entries mostly have Uralonet ID (P5902), but the lemma is not always identical to the form given on Uralonet, for which User:Tropylium is mostly responsible as the primary Uralic expert on Wiktionary. Would it be acceptable to import only those entries that have one of these IDs?
      • If so, that leaves the question of what to do with the remainder. It would be a shame if these can't be included in Wikidata, and would mean that Wiktionary is always more complete than Wikidata can be. Words that have an etymology on Wiktionary would have none on Wikidata, because of the Proto-Samic ancestral form being missing. —Rua (mew) 18:43, 30 October 2018 (UTC)
    @Rua: yes importing lexeme that have Álgu ID (P5903) or Uralonet ID (P5902) is fine with me. However, the lexeme for which the lemma is not identical to the form given on Uralonet do not have to be imported because they are not verifiable. They have to be similar to what the source says. Pamputt (talk) 21:58, 30 October 2018 (UTC)
  • Now pinging @Pamputt: as well.--Ymblanter (talk) 20:02, 21 October 2018 (UTC)
    I did not change my opinion because this bot wants to import reconstructed forms without any academic references. If the bot use academic work as source, it is fine with me, if not I oppose (and the discussion shows that we are in this case). Pamputt (talk) 20:08, 21 October 2018 (UTC)


zbmathAuthorID (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Zbmath authorid (talkcontribslogs)

Task/s: Add external IDS zbmath author ID (P1556) to wikidata item of mathematicians, based on manually checked data curation at


Function details: The mathematical bibliographic database (Q18241050) maintains links to several services and databases, amongst other to wikidata (e.g. links to Q371957). They have currently 11000 such links, the half of which having been established manually. I would like to register a bot that would store the corresponding back link wikidata->zbmath on the wikidata side, for any of these links zbmath->wikidata, i.e. adding one claim P1556. It would run on a daily basis, with a very low load (app 5 a day).

--Zbmath authorid (talk) 16:09, 27 August 2018 (UTC)

Please make some test edits.--Ymblanter (talk) 17:46, 30 August 2018 (UTC)


ScorumMEBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: ScorumME (talkcontribslogs)

Task/s: Создание и обновление Wikidata футбольной статистики на данный момент это победы, поражения, ничьи, кол-во забитых голов, кол-во пропущенных голов в лиге для определенных команд. Будет работать в полуручном режиме. Готов нести ответственность за все совершённые ботом правки.
The creation and updating of Wikidata football statistics at the moment are wins, losses, draws, number of goals scored, number of goals conceded in the league for certain teams. Bot will work in semi-manual mode. Ready to accept responsibility for all changes made by bot.

Function details: Бот написан на nodejs и использует библиотеки Wikidata Edit и Wikidata SDK от Maxlath Сервер слушает фид, который выдает в реальном времени статистику по футбольным матчам и, разобрав его, производит отправку запроса на обновление соответствующих данных Wikidata. Все лимиты по запросам соблюдает.
The bot is written in Node.js and uses the Wikidata Edit and Wikidata SDK libraries from Maxlath Server listens to feed, which provides real-time football statistics and sends a request for updating the relevant Wikidata data.

В БД бота хранится информация о идентификаторах записи wikidata.
Information about the identifiers of the wikidata records are stored in our database.

Question: How long should we wait for your decision on our request? -- – The preceding unsigned comment was added by ScorumMEBot (talk • contribs).
Please perform some test edits. Lymantria (talk) 06:03, 1 September 2018 (UTC)

GZWDer (flood) 3Edit

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Creating items for all Unicode characters

Code: Unavailable for now

Function details: Creating items for 137,439 characters (probably excluding those not in Normalization Forms):

  1. Label in all languages (if the character is printable; otherwise only Unicode name of the character in English)
  2. Alias in all languages for U+XXXX and in English for Unicode name of the character
  3. Description in languages with a label of Unicode character (P487)
  4. instance of (P31)Unicode character (Q29654788)
  5. Unicode character (P487)
  6. Unicode hex codepoint (P4213)
  7. Unicode block (P5522)
  8. writing system (P282)
  9. image (P18) (if available)
  10. HTML entity (P4575) (if available)
  11. For characters in Han script also many additional properties; see Wikidata:WikiProject CJKV character

For characters with existing items the existing items will be updated.

Question: Do we need only one item for characters with the same normalized forms, e.g. Ω (U+03A9, GREEK CAPITAL LETTER OMEGA) and Ω (U+2126, OHM SIGN)?--GZWDer (talk) 23:08, 23 July 2018 (UTC)

CJKV characters belonging to CJK Compatibility Ideographs (Q2493848) and CJK Compatibility Ideographs Supplement (Q2493862) such as 著 (U+FA5F) (Q55726748), 著 (U+2F99F) (Q55738328) will need to be split from their normalized form, eg. (Q54918611) as each of them have different properties. KevinUp (talk) 14:03, 25 July 2018 (UTC)

Request filed per suggestion on Wikidata:Property proposal/Unicode block.--GZWDer (talk) 23:08, 23 July 2018 (UTC)

  Support I have already expressed my wish to import such dataset. Matěj Suchánek (talk) 09:25, 25 July 2018 (UTC)
  Support @GZWDer: Thank you for initiating this task. Also, feel free to add yourself as a participant of Wikidata:WikiProject CJKV character. [15] KevinUp (talk) 14:03, 25 July 2018 (UTC)
  Support Thank you for your contribution. If possible, I hope you to also add other code (P3295) such as JIS X 0213 (Q6108269) and Big5 (Q858372) in items you create or update. --Okkn (talk) 16:35, 26 July 2018 (UTC)
  •   Oppose the use a of the flood account for this. Given the problems with unapproved defective bot run under the "GZWDer (flood)" account, I'd rather see this being done with a new account named "bot" as per policy.
    --- Jura 04:50, 31 July 2018 (UTC)
  • Perhaps we could do a test run of this bot with some of the 88,889 items required by Wikidata:WikiProject CJKV character and take note of any potential issues with this bot. @GZWDer: You might want to take note of the account policy required. KevinUp (talk) 10:12, 31 July 2018 (UTC)
  • This account has had a bot flag for over four years. While most bot accounts contain the word "bot", there is nothing in the bot policy that requires it, and a small number of accounts with the bot flag have different names. As I understand it, there is also no technical difference between an account with a flood flag and an account with a bot flag, except for who can assign and remove the flags. - Nikki (talk) 19:14, 1 August 2018 (UTC)
  • The flood account was created and authorized for activities that aren't actually bot activities. While this new task is one. Given that there had already been run defective bot tasks with the flood account, I don't think any actual bot tasks should be authorized. It's sufficient that I already had to clean up 10000s of GZWDer's edits.
    --- Jura 19:46, 1 August 2018 (UTC)
I am ready to approve this request, after a (positive) decision is taken at Wikidata:Requests for permissions/Bot/GZWDer (flood) 4. Lymantria (talk) 09:11, 3 September 2018 (UTC)
  • Wouldn't these fit better into Lexeme namespace? --- Jura 10:31, 11 September 2018 (UTC)
    There is no language with all Unicode characters as lexemes. KaMan (talk) 14:31, 11 September 2018 (UTC)
    Not really a problem. language codes provide for such cases. --- Jura 14:42, 11 September 2018 (UTC)
    I'm not talking about language code but language field of the lexeme where you select q-item of the language. KaMan (talk) 14:46, 11 September 2018 (UTC)
    Which is mapped to a language code. --- Jura 14:48, 11 September 2018 (UTC)
Note I'm going to be inactive for real life issue, so this request is   On hold for now. Comments still welcome, but I'm not able to answer it until January 2019.--GZWDer (talk) 12:08, 13 September 2018 (UTC)
  • @GZWDer: you could use the new account for this as well. --- Jura 11:16, 29 September 2019 (UTC)

GZWDer (flood) 2Edit

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Create new items and improve existing items from cebwiki and srwiki

Code: Run via various Pywikibot scripts (probably together with other tools)

Function details: The work include several steps:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages from w:ceb:Kategoriya:GeoNames ID not in Wikidata
  3. Import coordinate location (P625) for pages from w:ceb:Kategoriya:Coordinates not on Wikidata‎
  4. Add country (P17) for cebwiki items
  5. Add instance of (P31) for cebwiki items
  6. (probably) Add located in the administrative territorial entity (P131) for cebwiki items
  7. (probably) Add located in time zone (P421) for cebwiki items
  8. Add descriptions in Chinese and English for cebwiki items (only if step 4 and 5 is completed)

For srwiki, the actions are similar.

--GZWDer (talk) 13:56, 16 July 2018 (UTC)

Note: until phab:T198396 is fixed, this can only be done step-by-step and no mutliple task at a time.--GZWDer (talk) 14:02, 16 July 2018 (UTC)
  Support Thank you for your elaboration! Keeping to my word now. Mahir256 (talk) 13:59, 16 July 2018 (UTC)
@Mahir256: Please unblock the bot account, I'm not goint to import more statements from cebwiki (and srwiki) until the discussion is closed and I have several other (low-speed) use of the bot account.--GZWDer (talk) 14:01, 16 July 2018 (UTC)
Yes, I did that, as I said I would do. Although @GZWDer: what will differ in your procedure with regard to the srwiki items? A lot of those places might have eswiki article equivalents (with the same INEGI code (Q5796667)); do you plan to link these if they exist? Mahir256 (talk) 14:02, 16 July 2018 (UTC)
The harvest_template script can not check duplicates and duplicates can only be found after data is imported (this may be a bug, though).--GZWDer (talk) 14:04, 16 July 2018 (UTC)
@Pasleim: Would this functionality be easy to add to the tool? It certainly seems desirable, especially with regard to GeoNames IDs. Mahir256 (talk) 14:06, 16 July 2018 (UTC)
See phab:T199698. I do not use Pasleim's harvest template tool because the tool stops automatically when meeting errors (it should retry the edit; if meeting rate limit retry after some time)--GZWDer (talk) 14:10, 16 July 2018 (UTC)
  Oppose cebwiki is, as too many users concerned, the black hole of wikis. These so-called "datas" are having too many mistakes. --Liuxinyu970226 (talk) 14:15, 16 July 2018 (UTC)
  •   Oppose Needs to do far more checking as to whether related items already exist, to add the information and sitelink to existing items if possible; and to appropriately relate the new item to existing items if not. If other items already have any matching identifiers (but are eg linked to a different ceb-wiki item), or there is any other reason to think it may be a duplicate, then any new item should be marked instance of (P31) Wikimedia duplicated page (Q17362920) as its only P31, and be linked to the existing item by said to be the same as (P460). Jheald (talk) 14:19, 16 July 2018 (UTC)
    • Duplicates is easier to find after they are imported to Wikidata than on cebwiki.--GZWDer (talk) 14:24, 16 July 2018 (UTC)
@Jheald: It may be worth our time (or worth the time of those who already make corrections on cebwiki) to go to GeoNames and correct things our(them)selves so that in the event Lsjbot returns it doesn't recreate these duplicates. Mahir256 (talk) 14:34, 16 July 2018 (UTC)
@GZWDer: I try bloody hard to avoid creating new items that are duplicates, going to considerable lengths with off-line scripts and augmenting existing data to avoid doing so; and doing my level best to clear up any that have slipped online, as quickly as I can. I don't see why I should expect less from anybody else. Jheald (talk) 14:45, 16 July 2018 (UTC)
  •   Comment given the capacity problems of Wikidata, the fact that cebwiki is practically dormant, I don't think this should be done. Somehow I doubt the operator will do any of the announcement maintenance as I think they announced that a couple of months back and then left it to other Wikidata users. So no, not another 600,000 items. For the general discussion, see Wikidata:Project_chat#Another_cebwiki_flood?.
    --- Jura 20:18, 16 July 2018 (UTC)
    • cebwiki is not dormant as the articles are still being maintained.--GZWDer (talk) 00:30, 17 July 2018 (UTC)
    • Is there a way to see this on ceb:? I take it that any user on ceb:Special:Recent changes without a local user page isn't really active there.
      --- Jura 04:41, 26 July 2018 (UTC)
  •   Oppose Per Jheald. Planning that it "is much easier to find such duplicates if the data is stored in a structured way", so deliberately importing duplicates (which won't be merged within a very short time) is an abuse of Wikidata and our resources. Resources spent on cleaning the mess of some origin are missing at other places to bring high quality data to other wikis and elsewhere. The duplicates are a big problem, they pop up on search and queries etc. Sitelinks might be added after data is cleaned off-Wikidata (if cleaning is feasible at all; no idea perhaps deletion of articles on cebwiki is a better solution than importing cebwiki sitelinks here). --Marsupium (talk) 23:26, 18 July 2018 (UTC)
    • Duplicates already exists everywhere in Wikidata so it should not be warrented that different items refer to different concepts (though it is usually the case), and nobody should use search and query result directly without care. Searchs are not intended to be directly used by 3rd party users. For queries, if data consumer really think duplicates in Wikidata query result is an issue they can choose to exclude cebwiki-only items in query result.--GZWDer (talk) 23:45, 18 July 2018 (UTC)
  •   Oppose Thanks a lot for your work on other wikis, it is immensely useful, but this workflow is really not appropriate for cebwiki. Creating new cebwiki items without being certain that they do not duplicate existing items creates a significant strain on the community. It is not okay to expect people to find ways to exclude cebwiki-only items in query results as a result: these items should not be created in the first place. − Pintoch (talk) 09:55, 19 July 2018 (UTC)
    • probably 90% of entries are unique to cebwiki. It may be wise to import these unique entries first.--GZWDer (talk) 16:38, 20 July 2018 (UTC)
      • Well, whatever the actual percentage is, many of us have painfully experienced that it is way too low for our standards. It may be wise to be more considerate to your fellow contributors, and stop hammering the server too. A lot of people have complained about cebwiki item creations, and it is really a shame that a block was necessary to actually get you to stop. So I really stand by my oppose. − Pintoch (talk) 07:34, 21 July 2018 (UTC)
    • The approach outlined above doesn't really address any of the problems with the data.
      --- Jura 04:41, 26 July 2018 (UTC)

Plan 2Edit

The plan only does:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages


  1. It is easier to find articles exist in other Wikipedias by search and projectmerge (and possible mix'n'match and other tools)
  2. Also possible to find entries from GeoNames ID, and vice versa
  3. As no other data will be imported in plan 2, it will not pollute query results and OpenRefine (unless specifically query GeoNames ID)
  4. Others may still import other data to these items, but only if they're confident to do so; they had better import coordinates etc. from a more reliable database (e.g. GEOnet Names Server)

--GZWDer (talk) 06:09, 26 July 2018 (UTC)

  Oppose I just oppose your *cebwiki* importing, you are feel free to import Special:unconnectedpages links other than this wiki. --Liuxinyu970226 (talk) 04:45, 31 July 2018 (UTC)
  • @Pasleim: seems to have done quite a lot of maintenance on cebwiki sitelinks. I'm curious what his view is on this.
    --- Jura 06:39, 31 July 2018 (UTC)
  Oppose, this still pollutes OpenRefine results - especially when reconciling via GeoNames ID, which should be the preferred way when this id is available in the table. I don't see how voluntarily keeping the items basically blank would be a solution at all, it makes it harder to find duplicates. − Pintoch (talk) 11:54, 5 August 2018 (UTC)
Do you have experience with matching based on existing GeoNames IDs then? I still see items on a regular basis which have the wrong ID thanks to bots which imported lots of bad matches years ago (e.g. Weschnitz (Q525148) and River Coquet (Q7337301)), so it would be great if you could explain what you did to avoid mismatches so that bots can do the same. If bots assume that our GeoNames IDs are correct, they'll add sitelinks/statements/descriptions/etc to the wrong items and make a mess that's much harder to clean up than duplicates are. - Nikki (talk) 20:09, 5 August 2018 (UTC)
@Pintoch: Wikidata Qids are designated as persistant identifiers; they are still valid when the items are merged, but no guarantees should be assumed that any items (whether bot created or not) is never merged or redirected. They are plenty of mismatches in cebwiki and Wikidata (which should be solved) but creating new items will not bring any new mismatches. Also, why do you think that leaving cebwiki pages unconnected is easier to find duplicates?--GZWDer (talk) 09:28, 6 August 2018 (UTC)
@Nikki: Yes I have experience with matching based on GeoNames IDs, and it generally gives very bad results because many items get matched to cebwiki items instead of the canonical item. I don't have any good strategy to avoid mismatches and that is the reason why I regret that these cebwiki items have been created without the appropriate checks for existing duplicates. I understand that cebwiki imports are not the only imports responsible for the unreliability of GeoNames ids in Wikidata, but in my experience the majority of errors came from cebwiki. I am not sure I fully get your point: are you arguing that it is fine to create duplicate cebwiki items because GeoNames IDs in Wikidata are already unreliable? I don't see how existing errors are an excuse for creating more of them. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: I am arguing that we need to avoid linking the cebwiki pages to the wrong items because merges are vastly better than splits, and that will involve some duplicates. Duplicate IDs continue being valid and will point to the right item even after a merge. The same is not true of splitting and you never know who is already using the ID. I agree that it would be nice to reduce the number of duplicates it creates, but nobody seems to have any idea how it should do that without creating even more bad matches, which is why I was hoping you might have some tips. - Nikki (talk) 13:12, 12 August 2018 (UTC)
@Nikki: okay, I get your point, thanks. So, no I haven't looked into the problem myself. If I had time I would first try to clean up the current items rather than creating new ones (and you have worked on this: thanks again!). I don't think there is any rush to empty w:ceb:Kategoriya:Articles without Wikidata item, so that's why I oppose this bot request. − Pintoch (talk) 18:24, 12 August 2018 (UTC)
@GZWDer: creating new items will not bring any new mismatches: creating new items will create new duplicates, and that is what disrupts our workflows. I personally don't care about the Wikidata <-> cebwiki mapping. If you care about this mapping, then please improve it without creating duplicates (that is, with reliable heuristics to match the cebwiki articles to existing items). If you do not have the tools to do this import without being disruptive to other Wikidata users, then don't do it. If someone else files a bot request to do this task, with convincing evidence that their import process is more reliable than yours, I will happily support it. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: Your argument is basically "create new duplicates in any case is harmful" - but duplicates already exists everywhere, created by different users. They may be eventually merged, and their IDs are still valid. There're much more cases for no matchs found and no items will be created for them in the foreseeable future (as it is not possible to handle all 500,000 pages manually).
@GZWDer: there are three differences between other users' duplicates and your duplicates: the first is the scale (500,000 items for this proposal), the second is the absence of any satisfactory checks for existing duplicates (which is unacceptable), the third is the domain (geographical locations are pivotal items that many other domains rely on - creating a mess there is more disruptive than in other areas). This is about creating 500,000 new geographical items with no reconciliation heuristics to check for existing duplicates. This is really detrimental to the project, and I am not the only one complaining about it. − Pintoch (talk) 10:31, 19 August 2018 (UTC)
Also, what about first creating items for pages without extent items with same labels (this is the default setting of PetScan)?--GZWDer (talk) 20:12, 13 August 2018 (UTC)
I think checks need to be more thorough than that, for instance because cebwiki article titles often include disambiguation information in brackets. For instance, these heuristics would fail to identify,_Montana) and Amsterdam-Churchill (Q614935). − Pintoch (talk) 10:31, 19 August 2018 (UTC)
  • Oppose. Although I'm not aware of this being a policy so far, I believe new items should be created from the encyclopedia that is likely to have the best information on them. A bot shouldn't create new items from a Russian Wikipedia item about a US state or a US politician, and a bot shouldn't create new items about Russian city or politician from an English Wikipedia article. This restriction wouldn't necessarily apply to items that are not firmly connected to any particular, country, such as algebra for example. Jc3s5h (talk) 16:18, 30 August 2018 (UTC)
    • No, this isn't a policy and it never could be. One of Wikidata's main functions is to support other Wikimedia projects by providing interwiki links and structured data. Requiring links to a particular Wikipedia before an item is considered notable would cripple Wikidata. We also can't control which Wikipedias people copy data from. We can refuse to allow a bot to run but that doesn't stop people from doing it manually or with tools like Petscan and Harvest Templates. - Nikki (talk) 12:08, 31 August 2018 (UTC)
  • @Ivan_A._Krestinin: In the meantime, KrBot seems to be doing this. --- Jura 10:28, 11 September 2018 (UTC)
  • Have no time to read the discussion. My bot is importing country (P17), coordinate location (P625), GeoNames ID (P1566) from cebwiki now. — Ivan A. Krestinin (talk) 21:24, 11 September 2018 (UTC)
    • @Ivan_A._Krestinin: There is a lot of opposition to mass-creating new items for cebwiki items (see above), so you should create a new request for permissions before continuing. - Nikki (talk) 12:05, 12 September 2018 (UTC)
      • Ok, I disabled new item creation. I have code for connecting pages from different wikies. But it does not work without item creation because it is based on scheme: import data, find duplicate items, analyze data conflicts, labels and etc., merge items. — Ivan A. Krestinin (talk) 20:07, 12 September 2018 (UTC)
        • Thanks. The main issue is that people don't want duplicates. If you can explain what your bot does to avoid duplicates when you create a new request for permissions, it will hopefully be enough to change people's minds. :) - Nikki (talk) 09:00, 13 September 2018 (UTC)

If someone is creating items for all cebwiki articles, I'm still plan to add statements and descriptions to them. However for real life issues I'd like to place the request   On hold until January-February 2019 and see what happens. Comments and questions are still welcome, but I am probably not able to answer it anytime soon.--GZWDer (talk) 06:10, 12 September 2018 (UTC)

@GZWDer: Since there are too many oppose comments, and already bumped privacy concerns at WMF Trust & Safety, it's unlikely that your work can be approved, so why not withdrawn it? --Liuxinyu970226 (talk) 22:41, 15 September 2018 (UTC)

crossref botEdit

crossref bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: to add missing journals from crossref api


Function details: add missing journals from crossref --Mahdimoqri (talk) 21:12, 19 April 2018 (UTC)

See the discussion here and the data import request and workflow here

@DarTar, Daniel_Mietchen, Fnielsen, John_Cummings, Mahir256: any thoughts or feedback?


WikiBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: 1succes2012 (talkcontribslogs)

Task/s:access and parse data from Wikipedia

Code:To be developed

Function details:get article summaries, get data like links and images from a page and return it back to my users --1succes2012 (talk) 15:09, 17 June 2018 (UTC)

  Comment For accessing data, a bot account is not necessary (unless you are about to hit security limits). Matěj Suchánek (talk) 09:09, 3 August 2018 (UTC)


PricezaBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Pricezabot (talkcontribslogs)

Task/s: Add price to wikidata commercial products (e.g. phone, electronics, camera, etc)


Function details: --Pricezabot (talk) 09:18, 14 June 2018 (UTC) Priceza is price comparison engine in SEA, we have a lot of pricing data for commercial product and this bot will create statement in Wikidata on pricing detail from our website.

Comment If you're going to be importing data from your own aggregate website, this would quite literally be a spambot... Chrissymad (talk) 19:29, 14 June 2018 (UTC)


schieboutct (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ctschiebout (talkcontribslogs)

Task/s: create bot to add missing demonyms


Function details: --Ctschiebout (talk) 01:39, 22 April 2018 (UTC)

  Comment Code? Source? Matěj Suchánek (talk) 09:09, 3 August 2018 (UTC)

wikidata getEdit

wikidata get (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: (talkcontribslogs)



Function details: -- 10:51, 15 June 2018 (UTC)

  Comment Please expand this request. Matěj Suchánek (talk) 09:15, 3 August 2018 (UTC)

Wolfgang8741 botEdit

Wolfgang8741 bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Wolfgang8741 (talkcontribslogs)

Task/s: Openrefine imports to Wikidata.

Code: N/A

Function details: Data imports from Openrefine datasets --Wolfgang8741 (talk) 02:16, 18 June 2018 (UTC)

  Comment What kind of data from what source?  – The preceding unsigned comment was added by Matěj Suchánek (talk • contribs).
@Matěj Suchánek: Sorry I missed this comment. This is not a fully automated bot, but human assisted tool OpenRefine for larger imports, starting with small scale tests before larger application. The current focus is on the GNIS import at this time, yes the import description and process needs to be built out a bit more, I'm not using this until I refine the process and get community approval the to import. Initial learning curve and orientation to the WikiData processes in progress. Wolfgang8741 (talk) 15:49, 5 September 2018 (UTC)

CanaryBot 2Edit

CanaryBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ivanhercaz (talkcontribslogs)

Task/s: set labels, descriptions and aliases in Spanish to properties without them in Spanish.

Code: the code is available in PAWS, it is a Jupyter IPython notebook. But when I have time I will upload the ipynb and py file to CanaryBot repo in GitHub.

Function details: Well, this task that I am requesting is very similar to the first task that I requested, but I asked to Ymblanter for an opinion about it, and after that he/she recommend me to request a new task because, as Ymblanter said, I am going to use a new code because in this case I am going to set labels, descriptions and aliases in properties, nor in items as I did in my last task.

In addition, this scripts works differently: I extracted all the properties without label and description in Spanish, or both, and then I mixed both in one CSV in which I am filling the cells with their respective translations. When I will have all the cells I will run the script, which will read each row, check if the property has labels, descriptions and aliases in Spanish, if not, the script will add the content of their respective cells.

I have to test and improve some things of the script. It is very basic, but it works for what I want to do. I make a log of everything to knows how to solve an error if it happens.

Well, I await your answers and opinions. Thanks in advance!

Regards, Ivanhercaz   (Talk) 23:45, 10 May 2018 (UTC)

I improved the code: some stats, log report fixed... I think it is ready to run without problems. What I need now is to finish the translations of the properties. I await your opinions about this task. Regards, Ivanhercaz   (Talk) 15:54, 12 May 2018 (UTC)
I am ready to approve the bot task in a couple of days, provided that no objections will be raised. Lymantria (talk) 06:54, 13 May 2018 (UTC)
  • Could you link the test edits? I only find Portuguese.
    --- Jura 16:24, 13 May 2018 (UTC)
    Of course Jura, I think I shared the contributions in test.wikidata, but not, excuse me. I think that you refered to the edits in Portuguese that my bot made with its first task. In this case I only work with Spanish labels, descriptions and aliases. You can check my last contributions in test.wikidata. I checked the edit summary was wrong because it was "setting es-label" for all, I mean when the bot change a description, an alias or a label; I just fixed it and now it show the correct summary, as you can see in the last three editions. But I have find a bug that I have to fix: if you check this diff, you can see how the bot replaced the existing alias for the new, and what I want is to append the new aliases and keep the old aliases, so I have to fix it.
    I am not worry about the time or if the task is accepted now or in the future, I just wanted to propose it and talk about how it would work. But, being sincere, I have to fill the CSV file yet, so I have many time to fix this type of errors and improve it. For that reason I requested another task.
    Regards, Ivanhercaz   (Talk) 17:19, 13 May 2018 (UTC)
    For bot approvals, operators generally do about 50 or more edits here at Wikidata. These "test edits" are expected to be of standard quality.
    --- Jura 17:23, 13 May 2018 (UTC)
    I know Jura, but I can make the test edits in Wikidata without authorization or the request of someone because this task is not approved. Well, as you are requesting me these test edits, when the aliases bug has been solved I will run the bot in Wikidata to report here if it works fine or not. Regards, Ivanhercaz   (Talk) 17:29, 13 May 2018 (UTC)
    I fixed the bug of the aliases, as you can check here. I will notify you, Jura, when I have done the test edits in Wikidata and not in test.wikidata. Regards, Ivanhercaz   (Talk) 18:26, 13 May 2018 (UTC)
  • @Jura1, Ymblanter: Today I could only make a few test edits. I will make more in the next days to check it better. Regards, Ivanhercaz   (Talk) 18:15, 14 May 2018 (UTC)
    I forgot to share with you the log and if you check the notebook you can see the generated graph. Regards, Ivanhercaz   (Talk) 18:26, 14 May 2018 (UTC)

maria research botEdit

maria research bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: add missing articles and citations information for articles listed on PubMed Central


Function details: --Mahdimoqri (talk) 06:15, 13 March 2018 (UTC)

  Support Mahir256 (talk) 22:37, 13 March 2018 (UTC)
  Comment This Fatameh-based script is useful for most of phase 1 and works fine for PubMed IDs and for some Crossref IDs as well but it does not address the citation part from phase 2 onwards. --Daniel Mietchen (talk) 13:27, 14 March 2018 (UTC)
Thanks Daniel Mietchen, I modified the description of the task here to confirm what the bot does at the moment. Mahdimoqri (talk) 15:52, 14 March 2018 (UTC)
  Support That looks good to me. --Daniel Mietchen (talk) 19:54, 14 March 2018 (UTC)
  Support The Fatameh edits from this bot seems fine so far. It is a nice simple script. I note some Fatameh artifacts for the titles, e.g., "*." in BOOKWORMS AND BOOK COLLECTING (Q50454030). But I suppose we have to live with that... — Finn Årup Nielsen (fnielsen) (talk) 18:44, 14 March 2018 (UTC)
I was going to write the same thing. Can we remove the trailing full stop (".") ? I'm sure some bot could clean up the existing ones as well.
--- Jura 20:37, 14 March 2018 (UTC)
Thanks Finn Årup Nielsen (fnielsen) and Jura, I would be happy to add another script to remove asterisks or to fix any other issues you find, after the PMC items added.Mahdimoqri (talk) 23:10, 14 March 2018 (UTC)
For the final dot, can you remove this before adding it to label/title statement?
--- Jura 23:17, 14 March 2018 (UTC)
Thanks Jura! Unfortunately, as much as I know, Fatameh does not have any out of the box option for such changes. I'd recommend a separate script to be written just for this purpose since there are currently 14 Million other articles have such a problem ( Daniel Mietchen might be interested in such a script too. Mahdimoqri (talk) 02:51, 15 March 2018 (UTC)
@T Arrow, Tobias1984: could you fix Fatameh?
--- Jura 07:21, 15 March 2018 (UTC)
There is a task for it here: Mahdimoqri (talk) 15:08, 15 March 2018 (UTC)
Do any of the people who wrote the code actually follow phabricator? I tried to find the part of the code where the dot gets added/should be removed, but I was probably in the wrong module. Any ideas?
--- Jura 05:16, 16 March 2018 (UTC)
I'm just not checking it all that regularly. I've replied to the ticket. Fatameh relies on wikidatintegrator to do most of the heavy lifting. This uses PubMed as the data source and (unfortunately?) they actually report all the titles as ending in a period (or other punctuation). I think we need to find a reference for the titles without the period rather than just changing all the existing statements. There was a short discussion on the WikiCite Mailing List as well. I'm happy to work on a solution but I'm not really sure what is the best way forward. T Arrow (talk) 09:26, 16 March 2018 (UTC)
Jura, I added the fix for the trailing dots and asterisks in a separate script (fatameh_sister_bot). Any other issues that I can address to have your support?Mahdimoqri (talk) 06:22, 17 March 2018 (UTC)

Thanks all for providing feedback and offering solutions/help to address the issue with Fatameh. It seems it will be a fix eaither for Fatameh or a separate script. In eaither case, it is to be applied to all article items which I beleive could be done independently of this bot. Meanwhile, could you support and accept this bot so I can get it started and maybe set up a new bot for fixing other issues? Mahdimoqri (talk) 21:12, 16 March 2018 (UTC)

  Oppose I don't think we should approve another Fatameh based bot until major concerns are fixed. --Succu (talk) 21:24, 16 March 2018 (UTC)
Thanks for your feedback Succu. I just created a bot (Fatameh_sister_bot) that fixes the issue with the label for the items created using Fatameh. I'll make sure I run it on everything maria research bot creates to address the concern with the titles. Are there any other issues that I can address? Mahdimoqri (talk) 06:04, 17 March 2018 (UTC)
@Succu: I also fixed this issue from the root in Fatameh source code here so new items are created without the trailing dot.
Title statements would need the same fix and some labels have already been duplicated into other languages (maybe this is taken care of, but I haven't seen any in the samples).
--- Jura 09:35, 18 March 2018 (UTC)
Thanks for the feedback Jura. The translated labels (if any) are added to labels. I will take care of the title statement now.
@Jura1: the titles are also fixed and the code has been updated ( Any other issues that I can address to have your support for the bot?
I think the cleanup bot/task can be authorized.
--- Jura 12:30, 21 March 2018 (UTC)
@Jura1: wonderful! this is the request for the cleanup bot: fatameh_sister_bot. Could you please state your support there, for a bot flag?
I don't think edits like this one are OK, Mahdimoqri, because you are ignoring the reference given. And please wait with this kind of corrections until you got the flag. --Succu (talk) 22:36, 22 March 2018 (UTC)
@Succu: the title in the reference is not exactly correct. Please refer to this reference or this reference for the correct title. Would you like the bot to change the reference as well?  – The preceding unsigned comment was added by [[User:{{{1}}}|{{{1}}}]] ([[User talk:{{{1}}}|talk]] • [[Special:Contributions/{{{1}}}|contribs]]).
The cleanup should be fine. It just strips an artifact PMD adds.
--- Jura 09:24, 23 March 2018 (UTC)
Translated titles are enclosed within brackets. This should be changed. The current version overwrites existing page(s) (P304) with incomplete values. --Succu (talk) 10:08, 18 March 2018 (UTC)
@Succu: thanks for the feedback! I could not find any instance of either of the issues! Could you please reply with one instance of each of these two issues that is created by my bot so that I can address them? Mahdimoqri (talk) 03:17, 19 March 2018 (UTC)
[Sexually-transmitted infection in a high-risk group from Montería, Colombia]. (Q50804547) is an example for the first issue. Removing the brackets only is not the solution. --Succu (talk) 22:30, 22 March 2018 (UTC)
I will not import any items with translated titles (until there is a consensus on what is the solution on this). Mahdimoqri (talk) 14:06, 30 March 2018 (UTC)
We should try figure out how to handle them (e.g. import the original language and delete "title"-statement, possibly find the original title and add that as title and label in that language). For new imports, it would just need to skip adding the title statement and add a language of work or name (P407).
--- Jura 09:24, 23 March 2018 (UTC)
Or use original language of film or TV show (P364). Anyway, it should be made clear that the original title is not English. -- JakobVoss (talk) 14:45, 24 March 2018 (UTC)
Should attempt to add a statement that identifies them as not being in English before we actually manage to determine the original language?
--- Jura 21:12, 24 March 2018 (UTC)


AmpersandBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: PinkAmpersand (talkcontribslogs)

Task/s: Generate descriptions for village items in the format of "village in <place>, <place>, <country>"


Function details: With my first approved task (approved in July 2016, but not completed until recently), I set descriptions for about 20,000 Ukrainian villages based on their country (P17), instance of (P31), and located in the administrative territorial entity (P131) values. Now, I would like to use the latter two values to generalize this script to—ominous music—every village in the world!

The script works as follows:

  1. It pulls up 5,000 items backlinking to village (Q532)
  2. It checks whether an item is instance of (P31) village (Q532)
  3. It then labels items as follows:
    1. It removes disambiguation from labels in any language:
      1. It runs a RegEx search for ,| \(
      2. It removes those characters and any following ones
      3. It sets the old label as an alias for the given language
      4. If the alias is in Unicode, it creates an ASCII version and sets that as an alias as well
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
    2. It sets labels in all Latin-script languages:
      1. It checks if the current Latin-script languages all use the same label.
      2. If they don't, it does nothing except log the item for further review.
      3. If they do, it sets that label as the label for all other Latin-script languages, using a list of 196 (viewable in the source code)
      4. If the label is in Unicode, it also sets an ASCII version of the label as an alias
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
  4. And describes items as follows:
    1. It checks whether the item either a) lacks an English description or b) has an English description that merely says "village in <country>" or "village in <region>". (I've manually coded into the RegEx the names of every multi-word country. This still leaves a blind spot for multi-word entities other than countries. I welcome advice on how to fix this.)
    2. If so, it gets the item's parent entity. If that entity is a country, it describes the item as "village in <parent>"
    3. If the parent entity is not a country, it checks the grandparent entity. If that is a country, it describes the item as "village in <parent>, <grandparent>"
    4. Next onto the great-grandparent entity. "village in <parent>, <grandparent>, <great-grandparent>"
    5. For the great-great-grandparent entity, only the top three levels are used: "village in <grandparent>, <great-grandparent>, <great-great-grandparent>". This is slightly more likely to result in dupe errors, but the code handles those.
    6. Ditto the thrice-great-grandparent entity.
    7. If even the thrice-great-grandparent is not a country, the item is logged for further review. If people think I should go deeper, I am willing to; I may do so of my own initiative if the test run turns up too many of these errors.
  5. After 5,000 items have been processed, another 5,000 are pulled. The script continues until there are no backlinks left to describe.

Does this sound good? — PinkAmpers&(Je vous invite à me parler) 01:43, 22 February 2018 (UTC) Updated 22:17, 3 March 2018 (UTC)

Test run here. The only issue that arose was some items, like Koro-ni-O (Q25694), being listed in my command line as updated, but not actually updating. It's a bug, and I'll look into it, but its only effect is to limit the bot's potential, not to introduce any unwanted behavior. — PinkAmpers&(Je vous invite à me parler) 02:16, 22 February 2018 (UTC)
I will approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 08:39, 25 February 2018 (UTC)
Cool, thanks! But actually, I'm working on a few more things for the bot to do to these village items while it's "in the neighborhood", so would you mind holding off until I can post a second test run? — PinkAmpers&(Je vous invite à me parler) 00:23, 26 February 2018 (UTC)
This is fine, no problem.--Ymblanter (talk) 10:42, 26 February 2018 (UTC)
@Ymblanter:. Okay. I'm all done. I've updated the bot's description above. Diff of changes here. New test run here. There was one glitch in this test run, namely that the bot failed to add ASCII aliases for Unicode labels while performing the Latin-script label unanimity function. This was due to a stray space before the word aliases in line 247. I fixed that here, and ran a test edit here to check that that worked. But I'm happy to run a few dozen more test edits if you want to see that fix working in action. — PinkAmpers&(Je vous invite à me parler) 22:17, 3 March 2018 (UTC)
Concerning the Latin script languages, not all of them use the same spelling. For example, here I am sure that in lv it is not Utvin (most likely Utvins), in lt it is not Utvin, and possibly in some other languages it is not Utvin (for example, crh uses fonetic spelling, Utvin may be fine, but other names will not be fine). I would suggest to restrict this part of the task to major languages (say German, French, Spanish, Portuguese, Italian, Danish, Swedish, may be a couple of more) and for others make some research - I have no ideas for example what Navajo uses). The rest seems to be fine.--Ymblanter (talk) 07:48, 4 March 2018 (UTC)
I'm concerned about exonyms too. Even if a language uses the same name variant as other Latin-script languages for most settlements, then there are particular settlements for which it may not do so. 14:30, 4 March 2018 (UTC)
I considered that,, but IMHO it's not a problem. The script will never overwrite an earlier label, and indeed won't change the labels unless all existing Latin-script labels are in agreement. So the worst-case scenario here is that an item would go from having no label in one language to having one that is imperfect but not incorrect. An endonym will always be a valid alias, after all. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
I'm not sure that all languages consider an endonym as a valid alias if there's an exonym too. And if it is considered technically not incorrect then for some cases an endonym would still be rather odd. My concern on this is similar to one currently brought up in project chat. 07:58, 5 March 2018 (UTC)
I would think that an endonym is by definition a valid alias. The bar for "valid alias" is pretty low, after all. So if there isn't consensus to use endonyms as labels, I can set them as aliases instead. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Also, all romanized names are probably problematic. Many languages may use the same romanization system (the same as in English or the one recommended by the UN) for particular foreign language, but there are also languages which have their own romanization system. So a couple of the current Latin-script languages using the same romanization would be merely a coincidence. 14:49, 4 March 2018 (UTC)
I'm confused about your concern here. The only romanization that the script does is in setting aliases, not labels. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
All Ukrainian, Georgian, Arab etc. place names apart from exonyms are romanized in Latin-script languages. And there are different romanization systems, some are specific to particular language, e.g. Ukrainian-Estonian transcription. For instance, currently all four Latin labes for Burhunka (Q4099444) happen to be "Burhunka", but that wouldn't be correct in Estonian. 07:58, 5 March 2018 (UTC)
Well that's part of why I'm using a smaller set of languages now. Can you give me examples of languages within the set that have this same problem? — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Thanks for the feedback, Ymblanter. I've pared back the list, and posted at project chat asking for help with re-expanding it. See Wikidata:Project chat § Help needed with l10n for bot. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)

I note that here bot picks up name of a former territorial entity, though preferred rank is set for current parish. Also, is the whole territorial hierarchy really necessary in description if there's no need to disambiguate from other villages with the same name in the same country? For a small country like Estonia I'd prefer simpler descriptions. 14:30, 4 March 2018 (UTC)

The format I'm using is standard for English-language labels. See Help:Description § Go from more specific to less specific. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
The section you refer to concerns with in what order you go more specific in a description. As for how specific you should go it leaves it open. Apart from saying in above section that adding one subregion of a country is common and bringing two examples where whole administrative hierarchy is not shown. 07:58, 5 March 2018 (UTC)
To me, the takeaway from Help:Description is that using a second-level subregion is not required, but also not discouraged. It comes down to an individual editor's choice. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
  •   Comment I'm somewhat concerned about the absence of a plan to maintain this going forward. If descriptions in 200 languages for 100,000s items are being added, this becomes virtually impossible to correct manually. Descriptions can need to be maintained if the names changes, if the P131 is found to be incorrect or irrelevant. Already now default labels for items that may seem static (e.g. categories/lists) aren't maintained once the are added, this would just add another chunk of redundant data that isn't maintained. The field already suffers from absence of the maintenance of cebwiki imports, so please don't add more to it. Maybe one would want to focus on English descriptions and native label statements instead.
    --- Jura 10:16, 12 March 2018 (UTC)


Arasaacbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Lmorillas (talkcontribslogs)

Task/s:Search info, taxonimies and translate our images at


Function details: Search image names at wikidata and get info about them --Arasaacbot (talk) 12:28, 15 January 2018 (UTC)

@Arasaacbot, Lmorillas: Your GitHub repository doesn't have any actual code in it. It would be helpful if you could upload the source code. Also, can we please see a test run of 50-250 edits?Assuming you still plan on using this bot. — PinkAmpers&(Je vous invite à me parler) 23:27, 23 February 2018 (UTC)
@Arasaacbot, Lmorillas: Still interested? Matěj Suchánek (talk) 09:28, 3 August 2018 (UTC)
@Matěj Suchánek, PinkAmpersand: Sorry for the delay. I want to use wikidata for improving the content or our images service. I asked a friend that uses wikidata and he said that if we only need read permission there ar not needed special permissions, aren't they? Lmorillas (talk) 09:43, 7 August 2018 (UTC)
No, except some situations like big queries etc. Matěj Suchánek (talk) 11:24, 8 August 2018 (UTC)

taiwan democracy common botEdit

taiwan democracy common bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools) Operator: (talkcontribslogs)

Task/s: Input Taiwan politician data, it's a project from mySociety


Function details: follow this step to input politician data, mainly in P39 statement and related terms, constituency and political party. --~~~~

The operator can't be the bot itself. So who's going to operate the bot? Mbch331 (talk) 14:46, 18 February 2018 (UTC)
Operator will be (talkcontribslogs), bot: taiwan democracy common bot.
This is (talkcontribslogs), based on Wikidata:Requests for permissions/Bot/taiwan democracy common--Ymblanter (talk) 09:42, 25 February 2018 (UTC)
I would like to get some input from uninvolved users here before we can proceed.--Ymblanter (talk) 18:56, 1 March 2018 (UTC)
  • The bot might need a fix for date precision (9→7). It seems that everybody is born on January 1: Q19825688, Q8274933, Q8274088, Q8350110. As these items already had more precise dates, it might want to skip them.
    --- Jura 11:00, 12 March 2018 (UTC) Fixed, thanks.
@Jura1:, can we proceed here?--Ymblanter (talk) 21:06, 21 March 2018 (UTC)
I have a hard time trying to figure out what it's trying to do. Maybe some new test edits could help. Is the date precision for the start date in the qualifier of Q19825688 correct.
--- Jura 21:37, 21 March 2018 (UTC)


Newswirebot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Dhx1 (talkcontribslogs)


  1. Create items for news articles that are published by a collection of popular/widespread newspapers around the world.


  • To be developed.

Function details:


  • New items created by this bot can be used in described by source (P1343) and other references within Wikidata.
  • New items created by this bot can be referred to in Wikinews articles.


  1. For each candidate news article, check whether a Wikidata item of the same title exists with a publication date (P577) +/- 1 day.
    1. If an existing Wikidata item is found, check whether publisher (P123) is a match as well.
    2. If publisher (P123) matches, ignore the candidate news article.
  2. For each candidate news article, check whether an existing Wikidata item has the same official website (P856) (full URL to the published news article).
    1. If official website (P856) matches, ignore the candidate news article.
  3. If no existing Wikidata item is found, create a new item.
  4. Add a label in English which is the article title.
  5. Add descriptions in multiple languages the format of "news article published by PUBLISHER on DATE".
  6. Add statement instance of (P31) news article (Q5707594).
  7. Add statement language of work or name (P407) English (Q1860).
  8. Add statement publisher (P123).
  9. Add statement publication date (P577).
  10. Add statement official website (P856).
  11. Add statement author name string (P2093) which represents the byline (Q1425760). Note that this could the name of a news agency or combination of news agency and publisher if the writer is not identified.
  12. Add statement title (P1476) which represents the headline (Q1313396).

Example sources and copyright discussions:

--Dhx1 (talk) 13:00, 8 February 2018 (UTC)

Interesting initiative. How many articles do you plan to create per day? --Pasleim (talk) 08:44, 9 February 2018 (UTC)
I was thinking of programming the bot to regularly check Grafana and/or Special:DispatchStats or similar statistics endpoint, raising or lowering the rate of edits to a predefined limit. It appears that larger publishers may publish around 300 articles per day, so if bot was developed to work with 10 sources, that is around 3000 new articles per day, or one new article every 30 seconds. For the initial import, an edit rate of 1 article creation per second (what User:Research_Bot seems to use at the moment) would allow 86,400 articles to be processed per day, or approximately 30 days worth of archives processed per day. At that rate, it might take 4-5 months to complete the initial import. Dhx1 (talk) 10:12, 9 February 2018 (UTC)
We probably need the code and test edits to continue this discussion.--Ymblanter (talk) 08:31, 25 February 2018 (UTC)
@Dhx1: What do you think about Zotero translators? Could they be somehow used in order to speed up the process?--Malore (talk) 16:09, 20 September 2018 (UTC)
@Malore: I have been using scrapy which is trivial to use for crawling and extracting information. The trickier part at the moment is finding matching Wikidata articles that already exist, and writing to Wikidata. Pywikibot doesn't seem to allow writing a large Wikidata item at once with many claims, qualifiers and references. The API allows it however, and the WikidataIntegrator bot also allows it, albeit with little documentation to make it clear how it works. Zotero could be helpful if a large community forms around it with news website metadata scraping (for bibliographies).Dhx1 (talk) 11:53, 23 September 2018 (UTC)


KlosseBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Walter Klosse (talkcontribslogs)

Task/s: Bot is making mass editations with Widar.

Function details: This bot will do mass creating of items with QuickStatments predominantly about comic book characters with informations from sites such as or Marvel Database.

--Walter Klosse (talk) 20:40, 17 November 2017 (UTC)

Please be more precise to what task you are going to perform with your bot. Permission should be asked task by task. See also Wikidata:Bots. Lymantria (talk) 18:45, 25 November 2017 (UTC)
I am confused. Edits your bot has made since this request do not fall within the scope you described above, but seem to focus on programming languages. Lymantria (talk) 13:03, 8 December 2017 (UTC)
My bad, originally i was thinking that this bot will do only comic book characters, but now i do editations with more topics. --Walter Klosse (talk) 21:17, 15 December 2017 (UTC)
Please take in mind that permission should be requested task by task. See also Wikidata:Bots. But if your tasks are "small" perhaps a bot flag is not needed. Lymantria (talk) 15:17, 17 December 2017 (UTC)
@Walter Klosse: Still interested? Matěj Suchánek (talk) 09:19, 3 August 2018 (UTC)

NIOSH botEdit

NIOSH bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Harej (talkcontribslogs)

Task/s: Synchronize Wikidata with the NIOSHTIC-2 research database.


Function details: NIOSHTIC-2 is a database of occupational safety and health research published by NIOSH and/or supported by NIOSH staff. As part of my work with NIOSH I have developed scripts to make sure NIOSHTIC has corresponding entries in Wikidata (but, where possible, it will not create duplicates of entries that already exist on Wikidata). This allows NIOSH's data to be part of a greater network of data, for instance by including data from other sources such as PubMed. Better indexing this data is part of a longer-term effort to make it easier for Wikipedia editors to discover these reliable resources. --Harej (talk) 05:59, 14 November 2017 (UTC)

Please make some test edits.--Ymblanter (talk) 11:51, 19 November 2017 (UTC)
@Harej: Still interested? Matěj Suchánek (talk) 09:19, 3 August 2018 (UTC)
@Matěj Suchánek: In principle yes; however, I'm currently in the process of reworking my scripts so that they will work for Wikidata at its current size. Harej (talk) 15:09, 3 August 2018 (UTC)

Ymblanter, Matěj Suchánek, I have made some test edits. Please let me know if you have any questions. Harej (talk) 17:17, 12 August 2018 (UTC)

I am fine with the test edits and can approve the bot in several days provided there have been no objections raised.--Ymblanter (talk) 21:22, 12 August 2018 (UTC)
@Harej: For the first two test edits I get a „We are sorry, the page you are looking for was not found.“ message. --Succu (talk)
Succu, I generally find that happens when an entry is new enough in the NIOSHTIC database that it has a listing in the search engine but not a corresponding static page. However if you search NIOSHTIC for the date range during which the article was published, the article will still show up in the search results. I would link to search results, but it's one of those search engines where the results expire. (Frustrating, I know.) Harej (talk) 05:21, 13 August 2018 (UTC)
Why not omit them until the changes are online? Sorry for the delayed answer, Harej. --Succu (talk) 19:06, 26 August 2018 (UTC)
Succu, I have no way of distinguishing between which ones are online and which ones aren't. They show up in the search engine results anyway, so I would consider them valid assigned numbers. Harej (talk) 19:42, 26 August 2018 (UTC)
Load the page and look for „We are sorry, the page you are looking for was not found.“ I think if this string is not present all is fine. --Succu (talk) 19:47, 26 August 2018 (UTC)

Ymblanter, do you have any further questions or concerns regarding NIOSH bot? Harej (talk) 19:03, 26 August 2018 (UTC)

I am going to sleep now, I hope you will resolve the above issue by tomorrow, and then I will approve the bot.--Ymblanter (talk) 20:33, 26 August 2018 (UTC)


neonionbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jkatzwinkel (talkcontribslogs)

Task/s: Map semantic annotations made with annotation software neonion to wikidata statements in order to submit either bibliographical evidence, additional predicates or new entities to wikidata. Annotation software neonion is used for collaborative semantic annotating of academic publications. If a text resource being annotated is an open access publication and linked to a wikidata item page holding bibliographical metadata about the corresponding open access publication, verifiable contributions can be made to wikidata by one of the following:

  1. For a semantic annotation, identify an equivalent wikidata statement and provide bibliographical reference for that statement, linking to the item page representing the publication in which the semantic annotation has been created.
  2. If a semantic annotation provides new information about an entity represented by an existing wikidata item page, create a new statement for that item page containing the predicate introduced by the semantic annotation. Attach bibliographic evidence to the new statement analogously to scenario #1.
  3. If a semantic annotation represents a fact about an entity not yet represented by a wikidata item page, create an item page and populate it with at least a label and a P31 statement in order to meet the requirements for scenario #2. Provide bibliographical evidence as in scenario #1.

Code: Implementation of this feature will be published on my neonion fork on github.

Function details: Prerequisite: Map model of neonion's controlled vocabulary to terminological knowledge extracted from wikidata. Analysis of wikidata instance/class relationships ensures that concepts of controlled vocabulary can be mapped to item pages representing wikidata classes.

Task 1: Identify item pages and possibly statements on wikidata that are equivalent to the information contained in semantic annotations made in neonion.

Task 2: Based on the results of task 1, determine if it is appropriate to create additional content on wikidata in form of new statements or new item pages. For the statements at hand, provide an additional reference representing bibliographical evidence referring to the wikidata item page representing the open access publication in which neonion created the semantic annotation.

What data will be added? Proposed scenario is meant to be tried first on articles published in scientific open-access journal Apparatus. --Jkatzwinkel (talk) 06:15, 19 October 2017 (UTC)

I find this proposal very hard to understand without seeing an example - can you run one or mock one (or several) up using the neonionbot account so we can see what it would likely do? ArthurPSmith (talk) 13:12, 19 October 2017 (UTC)


Handelsregister (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: SebastianHellmann (talkcontribslogs)

Task/s: Crawl and then go to UT (Unternehmenstraeger) and add an entry for each German organisation with the basic info, especially registering court and assigned id by court into Wikidata.

Code: The code is a fork of (small changes only)

Function details:

Task 1, prerequisite for Task 2 Find all current organisations in Wikidata that are registered in Germany and find the correlating Handelsregister entry. Then add the data for the respective Wikidata items.

What data will be added? The Handelsregister collects information from all German courts, where all organisations in Germany are obliged to register. The data is given from the courts to a private company running the handelsregister, who makes part of the information public (i.e. UT - Unternehmenstraegerdaten, core data) and sells the other part. Each organisation can be uniquely identified by the registering court and the number assigned by this court (only the number is not enough, as two courts might assign the same number). Here is an example of the data:

  • Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH
  • Legal status: Gesellschaft mit beschränkter Haftung
  • Capital: 25.000,00 EUR
  • Date of entry: 29/08/2016
  • (When entering date of entry, wrong data input can occur due to system failures!)
  • Date of removal: -
  • Balance sheet available: -
  • Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
  • Prager Straße 38-40
  • 04317 Leipzig

Most items are stable, i.e. each org is registered, when it is founded and assigned a number by the court: Saxony District court Leipzig HRB 32853 . Then only the address and the status can change. For Wikidata, it is no problem keeping companies that are not existing any more as they should be conserved for historical purposes.

Maintenance should be simple: Once a Wikidata item contains the correct court and the number, the entry can be matched 100% to the entry in Handelsregister. This way Handelsregister can be queried once or twice a year to update the info in Wikidata.

Question 1: bot or other tool How data is added? I am keeping the bot request, but I will look at Mix and Match first. Maybe this tool is better suited for task 1.

Question 2: modeling Which properties should be used in Wikidata? I am particular looking for the property for the court as registering organisation, i.e. that has the authority to define the identity of an org. and then also the number (HRB 32853). The types, i.e. legal status can be matched to existing Wikidata entries. Most exist in the German Wikipedia. Any help for the other properties is appreciated.

Question 3: legal I still need to read up on the right situation for importing crawled data. Here is a hint given on the mailing list: You'd need to check whether in Germany it applies to official acts and registers too...

Task 2 Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.

It should meet notability criteria 2:

  • 2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.

The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.

--SebastianHellmann (talk) 07:39, 16 October 2017 (UTC)

Could you make a few example entries to illustrate how the items you want to create will look like? What strategy will you use to avoid creating doublicate items? ChristianKl (talk) 12:38, 16 October 2017 (UTC)
I think this is a good idea, but I agree there needs to be a clear approach to avoiding creating duplicates - we have hundreds of thousands of organizations in wikidata now, many of them businesses, many from Germany, so there certainly should be some overlap. Also I'd like to hear how the proposer plans to keep this information up to date in future. ArthurPSmith (talk) 15:13, 16 October 2017 (UTC)
There was a discussion on the mailing list. It would be easier to complete the info for existing entries in Wikidata at first. I will check mix and match for this or other methods. Once this space is clean, we can rediscuss creating new identifiers. SebastianHellmann (talk) 16:01, 16 October 2017 (UTC)
Is there an existing ID that you plan to use for authority control? Otherwise, do we need a new property? ChristianKl (talk) 20:40, 16 October 2017 (UTC)
I think that the ID needs to be combined, i.e. registering court and register number. That might be two properties. SebastianHellmann (talk) 16:05, 29 November 2017 (UTC)
  • Given that this data is fairly frequently updated, how is it planned to maintain it?
    --- Jura 16:38, 16 October 2017 (UTC)
* The frequency of updates is indeed large: A search for deletion announcements alone in the limited timeframe of 1.9.-15.10.17 finds 6682 deletion announcements (which legally is the most seriouss change and makes approx. 10% of all announcements). So within one year, more than 50,000 companies are deleted - which for sure should be reflected in according Wikidata entries. Jneubert (talk) 15:44, 17 October 2017 (UTC)
Hi all, I updated the bot description, trying to answer all questions from the mailing list and here. I still have three questions, which I am investigating. Help and pointers highly appreciated. SebastianHellmann (talk) 23:36, 16 October 2017 (UTC)
  • Given that German is the default language in Germany I would prefer the entry to be "Sachsen Amtsgericht Leipzig HRB 32853" instead of "Saxony District court Leipzig HRB 32853". Afterwards we can store that as an external ID and make a new property for that (which would need a property proposal). ChristianKl (talk) 12:33, 17 October 2017 (UTC)
Thanks for the updated details here. It sounds like a new identifier property may be needed (unless one of the existing ones like Legal Entity Identifier (P1278) suffices, but I suspect most of the organizations in this list do not have LEI's (yet?)). Ideally an identifier property has some way to turn the identifiers into a URL link with further information on that particular identified entity, that de-referenceability makes it easy to verify - see "formatter URL" examples on some existing identifier properties. Does such a thing exist for the Handelsregister? ArthurPSmith (talk) 14:58, 17 October 2017 (UTC)

Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas)

  Notified participants of WikiProject Companies for input.

@SebastianHellmann: for task 1, you might also be interested in OpenRefine (make sure you use the German reconciliation interface to get better results). See for details of its reconciliation features. I suspect your dataset might be a bit big though: I think it would be worth trying only on a subset (for instance, filter out those with a low capital). − Pintoch (talk) 14:52, 20 October 2017 (UTC)

Concerning Task 2, I'm a bit worried about the companies' notability (ot lack thereof), since the Handelsregister includes any and all companies. Not just the big ones where there's a good chance that Wikipedia articles, other sources, external IDs, etc exist. But also tiny companies and even one-person-companies, like someone selling stuff on Ebay or some guy selling christmas trees in his village. So it would be very hard to find any data on these companies outside the Handelsregister and the phonebook. --Kam Solusar (talk) 05:35, 21 October 2017 (UTC)

Agreed. Do we really need to be a complete copy of the Handelsregister? What for? How about concentrating on a meaningful subset instead that addresses a clear usecase? --LydiaPintscher (talk) 10:35, 21 October 2017 (UTC)
That of course is true. A strict reading of Wikidata:Notability could be seen as that at least two reliable sources are required. But then, that could be the phone book. Do we have to make those criteria more strict? That would require a RfC. Lymantria (talk) 07:58, 1 November 2017 (UTC)
I would at least try an RfC, but I am not immediately sure what to propose.--Ymblanter (talk) 08:05, 1 November 2017 (UTC)
If there's an RfC I would say that it should say that for data-imports of >1000 items the decision whether or not we import the data should be done via a request for bot permissions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
@SebastianHellmann: is well-intended, but I agree not all companies are notable. Even worse than 1-man shops are inactive companies that nobody bothered to close yet. Just "comes from reputable source" is not enough: eg OpenStreetMaps is reputable, and it would be ok to import all power-stations (eg see Enipedia) but imho not ok to import all recyclable garbage cans. We got 950k BG companies at but we are hesitant to dump them on Wikidata. Unfortunately official trade registers usually lack measures of size or importance...
It's true the Project Companies has not gelled yet and there's no clear Community of Use for this data. On the other hand, if we don't start somewhere and experiment, we may never get big quantities of company data. So I'd agree to this German data dump by way of experiment --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)

Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas)

  Notified participants of WikiProject Companies   Comment As best I know Project Companies has yet to gel up workable (for the immediate term) notability standard so the area remains fuzzy. Here is my current thinking [[23]] Very much like the above automation of updates. Hopefully the fetching scripts for Germany can be generalizeable to work in most developed countries that publish structured data on public companies. Would love to find WikiData consensus on Notability vs. its IT capacity and stomach for volumes of basically table data. Rjlabs (talk) 16:47, 3 November 2017 (UTC)

  • @Rjlabs: That hope is not founded because each jurisdiction does its own thing. OpenCorporates has a bunch of web crawling scripts (some of them donated) that they consider a significant IP. And as @SebastianHellmann: wrote their data is sorta open but not really. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • I   Support importing the data. Having the data makes it easier to enter the employer when we create items for new people. Companies also engage into other actions that leave marks in databases such as registering patents or trademarks and it makes it easier to import such data when we already have items for the companies. The ability to run queries about the companies that are located in a given area is useful. ChristianKl (talk) 17:20, 3 November 2017 (UTC)
    • @ChristianKl: at least half of the 200M or so companies world-wide will never have notable employees nor patents, so "let's import them just in case" is not a good policy --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • When it comes to these mass imports I would only want to mass import datasets about companies from authoritative sources. If we talk about a country like Uganda, I think it would be great to have an item for all companies that truly exist in Uganda. People in Uganda care about the companies that exist in their country and there government might not have the capability to host that data in a user-friendly way. An African app developer could profit from the existance of a unique identifier that's the same for multiple African countries.
When it comes to the concern about data not being up-to-date there were multiple cases where I would have really liked data about 19th century companies will doing research in Wikidata. Having data that's kept up-to-date is great, but having old data is also great. ChristianKl () 20:11, 13 December 2017 (UTC)
  • @Rjlabs: We did go back and forth with a lot of ideas on how to set some sort of criteria for company notability. I think any public company with a stock market listing should be considered notable, as there's a lot of public data available on those. For private companies we talked about some kind of size cutoff, but I suppose the existence of 2 or more independent reference sources with information about the company might be enough? ArthurPSmith (talk) 18:01, 3 November 2017 (UTC)
  • @ArthurPSmith:@Denny:@LydiaPintscher: Arthur, let's make it any public company that trades on a recognized stock exchange, anywhere worldwide, with a continuous bid and ask quote, that actually trades at least once per week is automatically considered "notable" for WikiData inclusion. This is by virtue that real people wrote real checks to buy shares and there is sufficient continuing trading interest in the stock to make it trade at least once per week, and some exchange somewhere endows that firm to be listed on its exchange. We should also note that passing this hurdle means that SOME data on that firm is automatically allowable on WikiData, provided the data is regularly updated. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • @Rjlabs, Denny, LydiaPintscher: Public Companies are a no-brainer because there's only 60k in the world (there are about 2.6k exchanges); compare to about 200M companies world-wide. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • Some data means (for right now) information like LEI, name, address, phone, industry code(s), brief text description of what they do) plus about 10 high level fields that cover the most frequently needed company data (such as: sales, employees, assets, principal exchange(s) down to where at least 20% of the volume is traded, unique symbol on that exchange, CEO, URL to investor relations section of website where detailed financial statements may be found, Central index key (or equivalent) with link to regulatory filings / structured data in the primary country where its regulated. For now that is all that should be "automatically allowable". No detailed financial statements, line by line, going back 10-20 years, with adjustments for stock splits, etc. No bid/offer/last trade time series. Consensus on further detail has to wait further gelling up. I Ping Lydia and Denny here to be sure they are good with this potential volume of linked data. (I think it would be great, a good start and limited. I especially like it if it MANDATES LEI, if one is available). Moving down from here (after 100% of public companies that are alive enough to actually trade) there is of course much more. However its a very murky area. >=2 independent reference sources with information about the company might be too broad causing WikiData capacity issues, or it may be too burdensome if someone has a structured data source that is much more reliable then WikiData to feed in, but lacks that "second source". Even if was one absolutely assured good quality source, and WikiData capacity was not an issue, I'd like to see a "sustainability" requirement up front. Load no private company data where it isn't AUTOMATICALLY updated or expired out. Again, would be great to have further Denny/Lydia input here on any capacity concern. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • "A modicum of data" as you describe above is a good criterion for any company. --Vladimir Alexiev (talk)
    • On WikidataCon there was a question from the audience of whether Wikidata would be okay with importing the 400 million entries about items in museums that are currently managed by various museums. User:LydiaPintscher answered by saying that her main concerns aren't technical but whether our communities does well with handling a huge influx of items. Importing data like the Handelsregister will mean that there will be a lot of items that won't be touched by humans but I don't think that's a major concern for our community. Having more data means more work for our community but it also means that new people get interested in interacting with Wikidata. When we make decisions like this, technical capabilities however matter. I think it would be great if a member of the development team would write a longer blog post that explains the technical capabilities, so that we can better factor them into our policy decisions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
I agree with Lydia. The issue is hardly the scalability of the software - the software is designed in such a way that there *should* not be problems with 400M new items. The question is do we have a story as a community to ensure that these items don't just turn into dead weight. Do we ensure that items in this set are reconciled with existing items if they should be? That we can deal with attacks on that dataset in some way, with targeted vandalism? Whether the software can scale, I am rather convinced. Whether the community can scale, I think we need to learn that.
Also, for the software, I would suggest not to grow 10x at once, but rather to increase the total size of the database with a bit more measure, and never to more than double it in one go. But this is just, basically, for stress-testing it, and to discover, if possible, early unexpected issues. But the architecture itself should accommodate such sizes without much ado (again - "should" - if we really go for 10x, I expect at least one unexpected bug to show up). --Denny (talk) 23:25, 5 November 2017 (UTC)
Speaking of the community being able to handle dead weight, it seems we mostly lack the tools to do so. Currently we are somewhat flooded by items from cebwiki and despite efforts by individual users to deal with one or the other problem, we still haven't tackled them systematically and this lead to countless items with unclear scope complicating every other import.
--- Jura 07:00, 6 November 2017 (UTC)
I don't think we should just add 400M new items in one go either. I don't think that the amount of vandalism that Wikidata faces scales directly with the amount of items that we host if we double the amount of items we don't double the amount of vandalism.
As far as the cebwiki items go, the problem isn't just that there are many items. The problem is that there's unclear scope for a lot of the items. For me that means that when we allow massive data imports we have to make sure that the imported data is up to a high quality where the scope of every item is clear. This means that having a bot approval process for such data imports is important and suggests to me that we should also get clear about the necessarity of having a bot approval for creating a lot of items via QuickStatements.
Currently, we are importing a lot of items via WikiCite and it seems to me that process is working without significant issues.
I agree that scaling the community should be a higher priority than scaling the number of items. One implication of that is that it makes sense to have higher standards for mass imports via bots than for items added by individuals (a newbie is more likely to become involved in our community when we don't great him by deleting the items they created).
Another implication is that the metric we celebrate shouldn't be focused on the number of items or statments/item but the number of active editors. ChristianKl () 09:58, 20 November 2017 (UTC)

Now what?Edit

Lots of good discussion above. Would anyone care to summarize, and how do we move to a decision? --Vladimir Alexiev (talk) 15:10, 5 December 2017 (UTC)

  • Some seem to consider it too granular. Maybe a test could be done with a subset. If no other criteria can be determined, maybe a start could be with companies with a capital > EUR 100 mio.
    --- Jura 20:21, 13 December 2017 (UTC)

Jntent's Bot 1Edit

Jntent's Bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools) Operator: Jntent (talkcontribslogs)


The task is to add assertions about airports from template pages.


The code is based on pywikibot's  under scripts in

Function details:

I added some constraints for literal values with regular expressions to parse "Infobox Airport" and similar ones in other languages. See the

I hope to scrape the airport templates from a few languages. Since the "Infobox Airport" template contains a links to pages about airport codes,

{{Infobox airport
| name         = Denver International Airport
| image        = Denver International Airport Logo.svg
| image-width  = 250
| image2       = DIA Airport Roof.jpg
| image2-width = 250
| IATA         = DEN
| ICAO         = KDEN
| FAA          = DEN
| WMO          = 72565
| type         = Public
| owner        = City & County of Denver Department of Aviation
| operator     = City & County of Denver Department of Aviation
| city-served  = [[Denver]], the [[Front Range Urban Corridor]], Eastern Colorado, Southeastern Wyoming, and the [[Nebraska Panhandle]]
| location     = Northeastern [[Denver]], [[Colorado]], U.S.
| hub          =

I will use links to pages about airport codes to find airports. One example is:

Template element Property Constraining regex (from properties)
IATA Property:P238 [A-Z]{3}
ICAO Property:P239 ([A-Z]{2}|[CKY][A-Z0-9])[A-Z0-9]{2}
FAA Property:P240 [A-Z0-9]{3,4}
coordinates Property:P625 6 numbers and 2 cardinalities surrounded by "|" from the coord template:
city-served Property:P931 The first valid link, standard behavior

 – The preceding unsigned comment was added by Jntent (talk • contribs).

  •   Comment I think there were some problems with these infoboxes in one language. Not sure which one it was. Maybe Innocent bystander recalls (I think he mentioned it once).
    --- Jura 11:28, 8 July 2017 (UTC)
    Well, I am not sure if I (today) remember any such problems. But it could be worth to mention that these codes also can be found in sv:Mall:Geobox and ceb:Plantilya:Geobox that are used in the Lsjbot-articles. These templates are not specially adapted to airports, but Lsj used the same template also for this group of articles. The Swedish template has special parameters for this ("IATA-kod" and "ICAO-kod") while the cebwiki articles uses a parameter "free" and "free_type". (Could be worth checking free1, free2 too.) See ceb:Coyoles (tugpahanan) as an example. -- Innocent bystander (talk) 15:17, 8 July 2017 (UTC)
  • @Jntent: in this edit I see the bot replaced FDKS with FDKB, while in en.wp infobox and lead section ar two values for ICAO cadoe : FDKS/FDKB. I would suggest to not change any existing value, or these should be checked manually probably if changed. The most safe way to act here would be to just add missing values. XXN, 14:07, 17 July 2017 (UTC)
  • @Jntent: Still interested? Matěj Suchánek (talk) 09:21, 3 August 2018 (UTC)


WikiProjectFranceBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Alphos (talkcontribslogs)

Task/s: Replace all located in the administrative territorial entity (P131) statements pointing from communes of France to cantons of France by territory overlaps (P3179) statements pointing from the same communes to the same cantons, including qualifiers (there are currently only date qualifiers), and adding a P794 qualifier on each new statement to indicate the subclass of canton.

Code: Partially available (for the first step) on GitHub

Function details: As has been the plan of WikiProject France since we proposed properties to better reflect the relationship between communes and cantons of France, we're now getting to actually push all the statements corresponding to these relationships from located in the administrative territorial entity (P131) to territory overlaps (P3179), and add the exact kind of P3179 this represents as qualifiers to said statements, without removing the original statements at first. Roughly 80 000 edits are to be expected.

At a later date, after checking everything went fine on the first pass, we plan on removing the (faulty) P131 statements between communes and cantons entirely, which will also be done by this bot.

Ash Crow
Thierry Caro
Nomen ad hoc
Marianne Casamance
Le Passant
  Notified participants of WikiProject France

--Alphos (talk) 20:00, 8 May 2017 (UTC)

@Alphos: Could you provide an example please? Thanks. — Ayack (talk) 09:05, 9 May 2017 (UTC)
Of course.
Nielles-lès-Bléquin (Q1000003) located in the administrative territorial entity (P131) canton of Lumbres (Q1726007)
would be replaced by :
Nielles-lès-Bléquin (Q1000003) territory overlaps (P3179) canton of Lumbres (Q1726007) (P794 (P794) canton of France (Q18524218))
Sainte-Croix (Q1002122) located in the administrative territorial entity (P131) canton of Montluel (Q1726339) (end time (P582) 2015-03-21)
would be replaced by :
Sainte-Croix (Q1002122) located in the administrative territorial entity (P131) canton of Montluel (Q1726339) (end time (P582) 2015-03-21 ; P794 (P794) canton of France (until 2015) (Q184188))
Other "examples" (in fact the whole list) can be found here :
The following query uses these:
  • Properties: subclass of (P279)    , instance of (P31)    , located in the administrative territorial entity (P131)    
     1 SELECT DISTINCT ?commune ?canton ?qualProp ?time ?precision ?timezone ?calendar WHERE {
     2   ?commune p:P31/ps:P31/wdt:P279* wd:Q484170 .
     3   ?commune p:P131 ?cantonStmt .
     4   ?cantonStmt ps:P131 ?canton .
     5   ?canton wdt:P31 ?cantonType .
     6   VALUES ?cantonType { wd:Q18524218 wd:Q184188 } .
     7   OPTIONAL {
     8     ?cantonStmt ?qualifier ?qualVal .
     9     ?qualProp wikibase:qualifierValue ?qualifier .
    10     ?qualVal wikibase:timePrecision ?precision ;
    11              wikibase:timeValue ?time ;
    12   	         wikibase:timeTimezone ?timezone ;
    13              wikibase:timeCalendarModel ?calendar ;
    14   }
    15 }
    16 ORDER BY ASC(?commune) ASC(?canton)
(which is what the bot works on)
Alphos (talk) 09:44, 9 May 2017 (UTC)
  Support The query seems good to me. Can you run a sample batch? -Ash Crow (talk) 18:26, 14 May 2017 (UTC)
The query is undeniably good, but I noticed an issue with edge cases on cantons with double status, working on it and running a small batch (LIMIT 20 or maybe a small french departement), probably later this week. Alphos (talk) 00:05, 16 May 2017 (UTC)
  SupportAyack (talk) 09:02, 16 May 2017 (UTC)
Please, let the bot run a couple of test edits. Besides, please, create the user page of the bot account (e.g. {{bot|Alphos}}). Lymantria (talk) 20:40, 25 June 2017 (UTC)
@Alphos: Any progress to be expected? Lymantria (talk) 13:51, 31 May 2018 (UTC)


Jefft0Bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jefft0 (talkcontribslogs)

Task/s: Add references to external ontologies


Function details: Add equivalent class (P1709) for an external ontology when that ontology already defines mappings to Wikipedia or Wikidata.
For example, Umbel version 1.50 has mappings to Wikipedia here:
such as
<> umbel:isRelatedTo <āori_language> .
and that Wikipedia page links to Wikidata item Māori (Q36451) . So this item should have equivalent class (P1709) to with a reference URL (P854) to the file above. --Jefft0Bot (talk) 15:15, 17 April 2017 (UTC)

Please make several test edits.--Ymblanter (talk) 19:48, 28 July 2017 (UTC)
@Jefft0: Still interested? Matěj Suchánek (talk) 09:18, 3 August 2018 (UTC)

MexBot 2Edit

MexBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: MarcAbonce (talkcontribslogs)

Task/s: Add official population data for Mexican municipalities.


Function details:
The script finds all Mexican municipalities with an INEGI municipality ID and gets all the official population data available from INEGI's (Mexican public institute that does the census) API.
It will either add or update this data, with INEGI as the source.
It will also add census as the method for the year ends in 0, when the census is made.
MarcAbonce (talk) 21:45, 8 June 2017 (UTC)

  Support --PokestarFan • Drink some tea and talk with me • Stalk my edits • I'm not shouting, I just like this font! 23:16, 8 June 2017 (UTC)
  Comment: Under which license INEGI publishes population data? XXN, 14:41, 9 June 2017 (UTC)
Not explicitated but it is like a CC BY, see point f in section "Del libre uso de la información del INEGI" of Términos de uso. I don't think is compatible. --ValterVB (talk) 17:35, 9 June 2017 (UTC)
Indeed, it only requires attribution, which is precisely what my script intends to add. Why would it be incompatible? Most of this data has already been manually added by people and apparently a Wikipedia scraping script too, but it's mostly unsourced. --MarcAbonce (talk)
Here we use CC0, if data here need citation the data is incompatible with the license. --ValterVB (talk) 05:47, 11 June 2017 (UTC)
Can census data even be licensed, though? As far as I know, facts cannot be licensed anywhere. If this is the case, this license would only be enforceable with the statistical data they generate (which I'm not using) but it wouldn't be enforceable for a simple, "natural" fact such as a total population.
Also, as I mentioned, this data is already allowed in practice. Wikipedia importing bots have added census data into Wikidata by claiming Wikipedia as the source (which is also CC0 incompatible, by the way), but this data is not generated by Wikipedia, but rather taken from INEGI and imported without source.
So, unless you actually plan to delete all the unsourced and Wikipedia sourced Mexican population data from this site, the most reasonable thing to do would be to treat this data the way it has been treated so far, for the sake of consistency.
--MarcAbonce (talk)
  Support Mexico is outside of the EU and thus there are no suis genesis concerns. Population data itself is about facts that in their nature aren't protected by copyright. ChristianKl (talk) 09:31, 25 June 2017 (UTC)
The license not depend if Mexico is in or out of EU. Wikidata use CC0, INEGI ask explicity "Must give credit for the INEGI as an author", for me they aren't compatible. --ValterVB (talk) 14:32, 25 June 2017 (UTC)


ZacheBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Zache (talkcontribslogs)

Task/s: Import data from pre-created CSV lists.

Code: based on Pywikibot (Q15169668), sample import scripts [24]

Function details:

--Zache (talk) 23:29, 4 March 2017 (UTC)

@Zache:, could you pls make a couple of test edits, I do not see any lakes in the contribution of the bot.--Ymblanter (talk) 21:20, 14 March 2017 (UTC)
@Zache: Are you still planning to do this taks? If so, please provide a few test edits. --Pasleim (talk) 08:13, 11 July 2017 (UTC)
Hi, i did the vaalidatahack without bot permissions so that one is done already. The lake thing is ongoing project and currently done using quickstatements for single lakes and CC0 licence screening for larger imports is still the same. Most likely there is also WLM related data imports in this summer by me, but i am not sure how big (most like under < 2000 items which some are updates for existing items and some are new) User Susannaanas started this and i am continuing with filling the details to the WLM the targets. Most likely this WLM stuff is made using pywikibot instead of quickstatements because i can do consistency checks with the code. --Zache (talk) 11:12, 11 July 2017 (UTC)


YULbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: YULdigitalpreservation (talkcontribslogs)


  • YULbot has the task of creating new items for pieces of software that do not yet have items in Wikidata.
  • YULbot will also make statements about those newly-created software items.

Code: I haven't written this bot yet.

Function details:

This bot will set the English language label for these items and create statements using publisher (P123), ISBN-13 (P212), ISBN-10 (P957), place of publication (P291), publication date (P577). --YULdigitalpreservation (talk) 18:04, 21 February 2017 (UTC)

good to run a test with a few examples so we can see what you're planning! ArthurPSmith (talk) 20:46, 22 February 2017 (UTC)
Interesting. Where does the data come from? Emijrp (talk) 12:04, 25 February 2017 (UTC)
The data is coming from the pieces of software themselves. These are pieces of software that are in the Yale Library collection. We could also supplement with data from (talk) 13:07, 28 February 2017 (UTC)
Please let us know when the bot is ready for approval.--Ymblanter (talk) 21:12, 14 March 2017 (UTC)


YBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Superyetkin (talkcontribslogs)

Task/s: import data from Turkish Wikipedia

Code: The bot, currently active on trwiki, uses the Wikibot framework.

Function details: The code imports data (properties and identifiers) from trwiki, aiming to ease the path to Wikidata Phase 3 (to have items that store the data served on infoboxes) --Superyetkin (talk) 16:42, 12 January 2017 (UTC)

It would be good if you could check for constraint violations insteaf of just blindly copying data from trwiki. These violations are probably all caused by the bot. --Pasleim (talk) 19:26, 15 January 2017 (UTC)
Yes, I am still interested in this. --Superyetkin (talk) 12:20, 4 March 2018 (UTC)
@Superyetkin: If that is the case, can you take away concerns as indicated by Pasleim, by showing how you'll avoid the constraint violations? Lymantria (talk) 13:53, 31 May 2018 (UTC)
I think I can check for constraint violations using the related API method --Superyetkin (talk) 17:55, 1 June 2018 (UTC)
@Pasleim: Would that be sufficient? Lymantria (talk) 09:10, 3 June 2018 (UTC)
That API method works only for statements which are already added to Wikidata. It would be good if some consistency check could be made prior adding a statement. For example, the unique value constraint of YerelNet village ID (P2123) can be checked be downloading all current values [25], importing them into an array and then prior saving a statement the bot checks if the value is already in the array. Format constraint can be realized in php by preg_match(). Item constraints don't need be be checked because they only indicate missing data but not wrong data. --Pasleim (talk) 17:52, 3 June 2018 (UTC)