Wikidata:Requests for permissions/Bot/ProteinBoxBot 2
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 16:01, 29 July 2015 (UTC)[reply]
ProteinBoxBot edit
ProteinBoxBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator:Andra Waagmeester (talk)
Task/s: Add content on genes and proteins to WikiData and subsequently enrich these items with relevant additions from authoritative resources.
Function details:
This is a (re)approval request for the ProteinBoxBot (original request) as stipulated on the Wikidata:Administrators noticeboard.The ProteinBoxBot is core in our efforts to enrich wikidata with Genes, Proteins, Diseases, Drugs and the relationships between them. We currently have code in place to enrich wikidata with content from entrez gene and Disease ontology. With the previous bot credentials this code has added human and mouse genes as well as the diseases in the disease ontology. The items have also been updated regularly. We now understand that part of these tasks have been performed out of the scope of the initial bot approval. We would like to request a reapproval of the ProteinBoxBot so we can continue enriching wikidata with genes and proteins. Next to the initial context of genes and proteins we would also like to request approval for adding diseases. Our work has also recently been presented at two occasions 1 2
The bot funtions and tasks are developed in sprints
We are currently developing the bot code to add drugs and the links between drugs, genes and diseases. We will request approval for these tasks in due time. Andrawaag (talk) 19:38, 2 June 2015 (UTC)[reply]
- I will approve the bot in a couple of days provided there have been no objections.--Ymblanter (talk) 18:02, 5 June 2015 (UTC)[reply]
- @Jura1:@Multichill: Any comments on this request for permissions? Is this in line with what was suggested on the Administrators' noticeboard? Andrawaag (talk) 20:54, 5 June 2015 (UTC)[reply]
- how can you ensure that the bot is not creating duplicates? In the last sprints, following items about the same disease were created: Q18554270, Q19587455, Q19609917, Q19833035. Have you developped a method to prevent such duplicates?
- Besides adding identifiers of external databases, do you plan to add other information?
- Is it possible to have a look at the source code of your bot? --Pasleim (talk) 22:04, 6 June 2015 (UTC)[reply]
- @Jura1:@Multichill: Any comments on this request for permissions? Is this in line with what was suggested on the Administrators' noticeboard? Andrawaag (talk) 20:54, 5 June 2015 (UTC)[reply]
- @Pasleim:
- 1. Until now the bot has been relying on wdq to prevent creating duplicates. I am aware of the delay of content between wdq and wikidata. As you rightly note, some sip through. This is typically the case when there is an anomaly in the connection between the bot and the api. I usually deal with these duplicates, but I apparently missed some. Normally there is typically a week between bot runs to allow updating of wdq. I am considering adding local logging to deal with this issue. However, I don't want to rely on a local log file only, since that would disregard additions to wikidata outside the scope of our bot. A hybrid between the wdq and a local log file of our bot additions, is something I am going to implement very soon.
- 2. Next to identifiers of external databases, we also add titles and aliases, as well as some provenance of when and form where the statements originates. The properties are listed on the page describing a specific bot task: e.g. https://www.wikidata.org/wiki/User:ProteinBoxBot/Disease_items#Properties
- 3. The source code of our bot is maintained on bitbucket and can be found at https://bitbucket.org/sulab/wikidatabots/src. We are currently in the process of restructuring this to separate accepted bot tasks from those that need requesting permission.Andrawaag (talk) 20:41, 8 June 2015 (UTC)[reply]
- @Pasleim:
- Regarding point 1: We will introduce further measures in order to avoid any duplicate entries, this also means that we might abandon usage of wdq in the bot code, so the fact that wdq only uses data dumps from production wikidata will not be an issue. Furthermore, the bot will generate and maintain a local list, which is mapping IDs and/or accession numbers from authoritative resources to Wikidata item IDs. I think the combination of these two approaches will prevent creation of duplicate entries. Sebotic (talk) 20:40, 8 June 2015 (UTC)[reply]
Ok, I support the bot tasks Gene and protein items, Disease items and Drug items. For the other tasks, please either provide more information in the next few days, or start a new approval request as soons as a more detailed plan is available. --Pasleim (talk) 19:32, 12 June 2015 (UTC)[reply]
- Oppose please do a testrun with of about 100 edits first. The bot's most recent edits are still those of the previous accident.--- Jura 20:05, 12 June 2015 (UTC)[reply]
- We aer refurbishing our bot code, so we will indeed run 100 test edits. I assume 100 test edits are okay? We completely stopped running the bot upon the request to not do anything until we got a reapproval. Andrawaag (talk) 09:27, 26 June 2015 (UTC)[reply]
@Jura1:@Pasleim:@Ymblanter: We have taken the time to fully reimplement our bot code. The current bot contains two layers. One being a core layer that takes responsibility for the communication with the api of WIkidata. The second layer is a task specific layer, which takes care of extracting knowledge from the relevant authoritative source (e.g. Entrez Gene in this specifc task approval case). Each approved task will share the Core layer.
I performed a little over 100 test edits, with the new bot and we are looking forward to run a full run on updating wikidata with gene information Andrawaag (talk) 09:45, 13 July 2015 (UTC)[reply]
@Jura1: does the pbb bot have your support following @andra 's last run? --I9606 (talk) 22:41, 17 July 2015 (UTC)[reply]
Sorry to jump in here at the last minute but could your bot include a url in the reference for the source of the info it adds (the particular web page - not just the web site) - even if it the same as the Entrez Gene ID (P351) statement? Joe Filceolaire (talk) 08:18, 23 July 2015 (UTC)[reply]
- To be clear, you are suggesting adding Property:P854 (reference URL) in addition to the 'stated in' and 'imported from' reference properties that are already being used, correct? This seems doable - at least for the entrez gene ids - but I am curious to know if there is a specific use case you have in mind here? --I9606 (talk) 18:17, 23 July 2015 (UTC)[reply]
- @Filceolaire: as I9606 already said, we can easily add an additional reference. Personally I would be a bit cautious here. URLs do change over time and with such a change on Entrez Genes, it is easier to fix it on the Entrez Gene ID property, which will be in effect on all Entrez Genes, without the need to rerun a bot cycle to change all the referenceURLs for all genes. Having said that, adding a referenceURL to the list of reference is really easy, so if your use case requires this, we can easily add it. Andrawaag (talk) 19:04, 23 July 2015 (UTC)[reply]
@Multichill: Does you prohibition to edit wikidata items with this bot still stands? We have dealt with the issues and would really like to restart our efforts. Andrawaag (talk) 16:30, 27 July 2015 (UTC)[reply]
support The bot has been fixed and substantially improved. There are active developers waiting for this approval and this is slowing their work down. Please unblock the bot and iterate with the bot developers about minor changes such as the suggested addition of the URL above. (Note that the information that would be in that URL is already present in the data being added by the bot.) --I9606 (talk) 16:35, 27 July 2015 (UTC)[reply]
- The bot is not currently blocked. I am not happy with the fact that there are no reactions but I am hesitant to approve the task provided that a lot of opposition has been voiced, and after the modifications nobody changed their opinion. if there is support for this task I will be happy to approve the bot.--Ymblanter (talk) 19:14, 27 July 2015 (UTC)[reply]
- @Ymblanter: I think that "a lot of opposition" is an overstatement. There is only one 'oppose' listed above from @Jura1: who has not responded to repeated pings here and on his talk page. @Andrawaag: has patiently responded to all requests here including a substantial restructuring of the code to reduce the chances of similar errors showing up again. This is a good bot, adding content of high value from a respectful and dedicated developer. It is very frustrating to see progress here hampered not by logical argument by people that care, but by a lack of response from people entrusted with authority. --I9606 (talk) 20:42, 27 July 2015 (UTC)[reply]
support I am also part of the team developing this bot, but just to underscore the point that there are many people who care about Wikidata thinking about this bot and this task, not a single rogue developer. Andrew Su (talk) 20:53, 27 July 2015 (UTC)[reply]
support sorry for not responding to the pings earlier. Have been a bit busy and traveling. As far as I can see all issues have been addressed. I'm in favor of starting the bot and keeping an eye on feedback. One thing to note is that while looking at your source code I noticed you seem to be reinventing the wheel. I would recommend that at some point you switch to Pywikibot as the underlying library. It contains all the low and high level stuff you need to be able to communicate with Wikibase/Wikidata. Don't let that block you for now, but you should look into that for the future. We have several people around here who would probably be more than willing to help you convert your code. Thanks for your work. Multichill (talk) 16:36, 28 July 2015 (UTC)[reply]