Open main menu

User talk:ProteinBoxBot




Can we merge Q22274764 and fermentation (Q41760) ?

Thank you. Tubezlob (🙋) 16:40, 6 July 2017 (UTC)

Done, thanks! Gstupp (talk) 17:20, 6 July 2017 (UTC)


Looking for descriptions starting with "a ..", I came across a few generated by this bot (sample correction). For future runs, maybe the initial "a", underscores in text and final dots can be removed directly.
--- Jura 14:16, 7 September 2017 (UTC)

Hi Jura, I've made an issue on our github to address this. Gstupp (talk) 23:30, 7 September 2017 (UTC)
Is there an official style guide for descriptions? The suggestions seem reasonable, but unless they were part of an official style guide I think keeping the status quo could be convincingly argued as well (to keep in sync with the source databases). Thoughts? Andrew Su (talk) 00:23, 8 September 2017 (UTC)
We have one: It's Help:Descriptions. --Izno (talk) 20:02, 8 September 2017 (UTC)
Thanks for the link, very helpful... Best, Andrew Su (talk) 20:28, 8 September 2017 (UTC)
  • Maybe I need to ping the operators as well @Andrawaag, Sebotic:.
    --- Jura 07:26, 8 September 2017 (UTC)
Both Andrew and Gstupp are also part of the team maintaining this bot. I concur with Andrew, that maintaining the description as they are in the original source, makes sense. --Andrawaag (talk) 07:45, 8 September 2017 (UTC) 07:44, 8 September 2017 (UTC)
It would be good to have an operator for this bot that is knowledgeable about Wikidata rules. We already had to block it in the past and you should avoid that this re-occurs.
--- Jura 08:03, 8 September 2017 (UTC)
I've updated the bot to use take care of this: Gstupp (talk) 23:08, 2 October 2017 (UTC)

ICD-10-CM and ICD-10 codes are not necessarily interchangeable.Edit

I've reverted your change of the ICD-10 code over at REM sleep behavior disorder (Q2103933). G47.52 code is not found in ICD-10. This can be double checked by searching for it on the online version of ICD-10.[1] Whilst it is within ICD-10-CM; this now has it's own property: ICD-10-CM (P4229) Little pob (talk) 20:17, 23 September 2017 (UTC)

I've edited the bot to use ICD-10-CM and switched over codes to the appropriate system (as determined by the Disease Ontology). Gstupp (talk) 23:29, 25 September 2017 (UTC)

Cerebrovascular disease and strokeEdit

Hi! The en label of stroke (Q12202) should be "stroke", according to enwiki, and DOID:6713 (cerebrovascular disease) should be linked to cerebrovascular disease (Q3010352), instead of stroke (Q12202). Thanks. --Okkn (talk) 14:28, 22 October 2017 (UTC)

Hi Okkn. I think I've cleaned up the items and made an issue on DO. Gstupp (talk) 18:11, 2 November 2017 (UTC)

Thank you, Gstupp! --Okkn (talk) 10:18, 3 November 2017 (UTC)

Help with some ontology issues in diseases?Edit

I've been cleaning up some ontology issues in wikidata, but I'm struggling to figure out what to do with diseases. One set of problems is cycles - x subclass y subclass x. The list of remaining cases is here and you'll see that all the ones left there are in the area of diseases. Looking at them, the relationships seem to be based on the Disease Ontology in ways that don't make sense. For example, aseptic meningitis (Q4804182) is stated as a subclass of viral meningitis (Q3301664) based on the DOID's, but logically the subclass relationship should be the other way round, if those definitions are correct (viruses are a subset of non-bacterial causes). Is this a problem in the Disease Ontology, or with the relationships that have been entered here? We really need an expert or two to help out on this, I'd appreciate if you aren't able to do it if you could point to somebody who can help. Thanks! ArthurPSmith (talk) 15:24, 2 November 2017 (UTC)

Thanks ArthurPSmith, I will look into this. Gstupp (talk) 18:12, 5 November 2017 (UTC)
I've gone through and fixed some of the ones that were added by users (not from Disease Ontology) that are wrong. There are others, like the virus one, that are possibly issues with DO, but are confusing to me. I've shared with with the DO team and will update. Gstupp (talk) 00:01, 8 November 2017 (UTC)
Thanks! There's also one three-level cycle in diseases maybe you could check out too? Wikidata:WikiProject Ontology/Problems/3rd-order subclass of self - I was trying to sort it out myself but got very confused between the different language wikipedias what was going on here regarding macular degeneration. ArthurPSmith (talk) 16:30, 9 November 2017 (UTC)

“remove deprecated statements”Edit

You changed some items with the edit summary “remove deprecated statements” (example). What does that mean, and what shall we do with the remaining, almost empty items? —MisterSynergy (talk) 08:17, 3 November 2017 (UTC)

Hello MisterSynergy, Thanks for pointing this out. This is the result of inconsistencies in the external IDs between Robinow Syndrome and its subclasses. The DO bot failed to update the statements on this item because the external IDs conflicted with the external IDs in another item. You can see the error in the log file row 1337.
I've added back the DO ID on this, so when the bot runs again it will re-add the current statements. I've also created an issue in DO to fix the IDs.
There are 34 other items in the log that also probably have this issue, and so I'll take a look at them also..
Gstupp (talk) 00:14, 8 November 2017 (UTC)

Ok, I've fixed up the others. Found using this sparql query: link Gstupp (talk) 23:29, 8 November 2017 (UTC)

Thanks, looks good indeed! MisterSynergy (talk) 08:54, 9 November 2017 (UTC)

has listed ingredient (P4543) and drugsEdit

We imported a lot of data about drugs from OpenFDA. As far as I remember the reason why only store has active ingredient (P3781) was that at the time we had no good property for the other ingredients. ChristianKl () 19:46, 22 November 2017 (UTC)

Hi ChristianKl, Most of the active ingredient information actually came from the EMA. (Example) The inactive ingredients are sadly not available in a structured form (that I can find..). Same story with OpenFDA. It looks like many drug labels have brand name and active ingredients, (along with with UNII so we don't have to string match!!), but no other ingredients. They only exist in the free text package labeling and would be a lot of work to pull out and normalize. In addition, the indications are not structured in any way, which is why we grabbed them from EMA. See the openFDA field in this. Gstupp (talk) 20:02, 22 November 2017 (UTC)

This bot created a subclass loop!Edit

Preprotein translocase subunit SecE (Q24738466) was just made a subclass of Protein translocase SEC61 complex, gamma subunit (Q24768152), which was just made a subclass of Preprotein translocase subunit SecE (Q24738466)! Something's gone wrong there.

Also, any progress on resolving the remaining subclass loops in diseases? See Wikidata:WikiProject Ontology/Problems/subclass of subclass of self - thanks! ArthurPSmith (talk) 15:02, 28 November 2017 (UTC)

Hello Arthur. You're too fast! The bot was in the middle of a bot run and hadn't finished updating all items when you posted. The run has completed and I don't see any loops.

I submitted two issues for the remaining subclass loops 1 2 Gstupp (talk) 20:27, 28 November 2017 (UTC)

@Gstupp: yes, it looks better, thanks, and thanks for posting those issues! ArthurPSmith (talk) 21:33, 28 November 2017 (UTC)

cell (Q7868) subclass of (P279) cellular component (Q5058355)?Edit

Cell is a part of a cell? --Fractaler (talk) 07:22, 20 December 2017 (UTC)

Yes. According to the reference: Gene Ontology, cell is a cellular_component. Subclass does not mean "part of". Gstupp (talk) 18:22, 20 December 2017 (UTC)

Now cellular component (Q5058355) (cellular component) have description: "part of a cell". Right? Fractaler (talk) 06:07, 21 December 2017 (UTC)

Do not remove statements automatically created by other usersEdit

Hi. I noticed that ProteinBoxBot removes instance of (P31), subclass of (P279) and has part (P527) statements created by other users. (ex. I think they should not be removed automatically even if they are not defined in Gene Ontology. --Okkn (talk) 05:15, 2 February 2018 (UTC)

@Okkn: Thanks for bringing this to our attention -- those are definitely unintentional changes. The primary person to look at this is out of the office at the moment, but we'll get this addressed early next week. Apologies, and thanks again for the bug report! Best, Andrew Su (talk) 17:09, 2 February 2018 (UTC)
Hi @Okkn:, I've implemented the changes here. I'll look into seeing if I can revert what was overwritten. Thanks for pointing this out. Gstupp (talk) 20:23, 6 February 2018 (UTC)

To prevent items from being P31 and P279* of the same classEdit

I think using instance of (P31) to indicate the semantic type of the item is useful to query. On the other hand, ProteinBoxBot creates lots of instance of (P31) statements which results in having instance of (P31) and subclass of (P279)* of the same class (such as disease (Q12136), gene (Q7187), biological process (Q2996394), cellular component (Q5058355) and molecular function (Q14860489)), and some ontologists regard this as a problem.

To resolve this problem, I propose that we create new metaclasses (first-order metaclass (Q24017414)), like type of sport (Q31629), cell type (Q189118) or type of mathematical function (Q47279819), corresponding to the above classes, and that we replace disease (Q12136), gene (Q7187) etc. in instance of (P31) statements to the metaclasses ("disease class" or "type of gene"). --Okkn (talk) 06:08, 4 February 2018 (UTC)

Subclass of self is just wrong!Edit

ProteinBoxBot has been making a number of edits like this one that assert that something is a subclass of itself. This is meaningless. I will remove them, but please ensure they don't recur. ArthurPSmith (talk) 21:25, 27 February 2018 (UTC)

Same thing for instance of (P31) (I just removed 3 just now on cellular component (Q5058355), molecular function (Q14860489) and biological process (Q2996394)). Cdlt, VIGNERON (talk) 09:09, 11 May 2018 (UTC)
Thanks for pointing these out Gstupp (talk) 19:39, 11 May 2018 (UTC)

Redundant aliasesEdit

In edits like this one, the bot seems to be adding "Name (disorder)" as an alias (complete with unwanted capitalization). Even if that's in a source, it's probably not correct to be adding the source's disambiguator to the Wikidata record.

Also, for edits such as this one, is there a way to tell it to stop adding aliases after they've been corrected? Abbreviations such as "acute/subac." should be spelled out, and the abbreviated version shouldn't be used at all.

(Please ping me if you have any questions.) WhatamIdoing (talk) 22:30, 5 March 2018 (UTC)

Hi WhatamIdoing. Yes, these "Name (disorder)" aliases are present in the source (DO). I agree its not useful to have this. I'll edit the bot to filter those out. Are there any others you noticed? I see "morphologic abnormality", and "finding" as well.

For the second issue, this is a lot harder to address. I have no way of knowing that the abbreviated version was corrected without getting the full history of every item... I can filter "acute/subac." out as well. But I only see one item containing "subac.". Have you seen other abbreviations? Gstupp (talk) 00:33, 6 March 2018 (UTC)

I think that the ideal behavior is to make sure that "Name" (or "name") is present, and if not, to add the name without the (disorder) appended. (I don't know if that's easy to code, though.
I think that "NOS" is the most common abbreviation, and it's probably just as irrelevant as (disorder), but as it theoretically contains some content ("not otherwise specified"), I have slightly more sympathy for it. WhatamIdoing (talk) 06:26, 6 March 2018 (UTC)

More subclass loops from imported ontologiesEdit

We now have regulation of leucine import across plasma membrane (Q27303095) subclass of regulation of leucine import (Q22303228), which is in turn a subclass of regulation of leucine import across plasma membrane (Q27303095) thanks to your recent edits - in this case both relations are referenced to "Gene Ontology", the first dated 9 May 2017, and the second dated 6 March 2018. The new relation doesn't make sense to me given the labels - maybe something's gone wrong with the ID's? And we now also have lactic acidosis (Q1500373) subclass of metabolic acidosis (Q1598200), which is in turn a subclass of lactic acidosis (Q1500373), this time both sourced to "Disease Ontology" releases, the first from 5 March 2018 and the second from 5 December 2017. Has this source reversed this relation in the last few months? In any case, assuming the source does not contain both statements now, the one that is no longer "valid" should at least be deprecated (I think I would prefer it to be removed altogether, but maybe recording the old version of the relation is useful for some purpose). ArthurPSmith (talk) 14:42, 6 March 2018 (UTC)

@ArthurPSmith: Metabolic acidosis was fixed about a week ago. --Okkn (talk) 05:26, 7 March 2018 (UTC)
The disease ones should be fixed. For the GO ones, I made an issue and am working on this. Gstupp (talk) 19:44, 14 March 2018 (UTC)
That was me I think! I've been working on mesh items and accidentally merged coloring agents and food coloring. Thanks for fixing it. Gstupp (talk) 19:44, 14 March 2018 (UTC)

aromatase: Wikipedia links moved from the enzyme to the geneEdit


It seems ProteinBoxBot really wants the Wikipedia articles to be linked by the gene element rather than the enzyme:

I've added them back (and the commons category) to the enzyme, but won't ProteinBoxBot move them again? 06:10, 21 March 2018 (UTC)

User:The_RedBurn, This article (and indeed most articles about human genes/proteins) is about both the gene and protein, and it makes it consistent to have all of these articles linked to the gene wikidata item. Additionally, the infobox on the Wikipedia page requires that the page be linked to the gene item and not the protein. As right now, the infobox is now displaying NA. Gstupp (talk) 18:16, 21 March 2018 (UTC)
It seems rather strange to link the articles (which all seem to mainly talk about the enzymes (and then about their coding genes)) to the genes instead of the enzymes. The drawback of doing so is that the Wikipedia mobile apps and VisualEditor describe those enzymes as genes, which may confuse the user. Is there any other reason than "most/all of these articles are linked to the gene wikidata item"? About the infobox, that's just a technical detail, I've fixed it for now. The RedBurn (ϕ) 19:05, 21 March 2018 (UTC)

Trying to clean up NCI Thesaurus cross-references before Cellosaurus wikipedia bot starts workingEdit

I am trying to add many new disease terms (mainly in 3 categories: 1) animal disease terms (which are not problematic for ProteinBoxBot as you do not have them in DO), 2) cancer terms (mostly children of existing ones, so here again there is no problems), 3) genetic disease terms (and here there are many problems).

The problems are that the mapping of disease ontology to NCI Thesaurus are: 1) incomplete, 2) partially wrong, 3) somehow do not go to the right specific wikidata entry). Example: for neurofibromatosis (Q847605), the bot wants to add C3273 which I have added to the correct Wikidata entry: neurofibromatosis type I (Q7616509) and similarly C3274 was added to neurofibromatosis type II (Q1935832).

How do we go forward to correct all these errors and inconsistencies? --Amb sib (talk) 20:07, 20 May 2018 (UTC)

Removing valid MeSH ID as deprecated stetementsEdit

Hi, ProteinBoxBot has removed valid MeSH IDs and other statements:

You should undo them. Regards, --Okkn (talk) 07:45, 26 June 2018 (UTC)

It looks like these removed in DO and then added back in for some reason. There was a new release this morning with them added back, so the bot should re-add them... Gstupp (talk) 20:01, 26 June 2018 (UTC)

The bot also removed ICD-9-CM (P1692), for example this edit. ICD-9 is old, but despite the recent publication of ICD-11, it is still in common use and is not deprecated. Please roll back these erroneous removals. --RexxS (talk) 22:28, 26 June 2018 (UTC)

Ya, that's a good question.. Thanks for bringing this up. They were removed by DO and so were removed by the DO bot. I made an issue on their issue tracker: Gstupp (talk) 23:07, 26 June 2018 (UTC)
Do you realise that the bot is not only removing valid information from Wikidata but also from all the Wikipedias who derive their ICD9 information from Wikidata? Wikidata is not Disease Ontology and we should not be hostage to their miscalculations. Bot operators are responsible for the edits made by their bots and must take responsibility for correcting their errors. Is it necessary to ask for administrator assistance to have these mistakes rolled back? --RexxS (talk) 19:18, 27 June 2018 (UTC)
The goal of the DO bot is to accurately reflect what DO says. It is not a disease bot, its a DO bot. If there are non DO references for a statement (or no references at all), the bot will leave it alone. The bot only removed statements whose only reference was DO. Course of action: 1) I don't know why they where removed from DO, but they should be added back. 2) We are adding disease info from another source (MONDO), and so will have two sources of information. 3) In a broader sense, I think we should strive to have multiple independent sources of information (with references for each). In this way, the trustworthiness of a statement can be better assessed, and as a byproduct, the impact of sources going rogue is better mitigated. Gstupp (talk) 20:47, 27 June 2018 (UTC)
When it comes to these identifiers the reference is in the code itself. The code is the reference to the ICD handbook, so DO is entirely irrelevant here. This is a major issue and the idea of using DO as a reference is faulty. DO is not a reference, it is a directory of related links and queries. DO may be used as an additional link, but these codes are not unreferenced. I think that applies to all DO-referenced codes, so you'll have to stop doing that entirely.CFCF (talk) 14:16, 28 June 2018 (UTC)
On the contrary, DO is relevant here. The source for the claim came from DO, in the first place. As far as I know, there hasn't been any effort to get the ICD handbook in Wikidata. So when you are looking at these disease statement you do see ICD codes if they exist as mappings in the disease ontology. If the disease ontology, no longer substantiates those claims they need to be removed since the reference is no longer accurate. The solution here would be that that ICD handbook in its entirely would be added to Wikidata, or if other resources would add mapping to ICD codes. --Andrawaag (talk) 15:15, 28 June 2018 (UTC)
I agree. I think if there were statements that had a second reference (to the ICD handbook or some other source), then the bot should just delete the DO reference. In these cases where DO was the only reference stated, when/if the latest version of DO ceases to make those statements then they should be removed from Wikidata. So I think the bot is working correctly here (and that I hope the DO team restores those links ASAP, per the github issue). Best, Andrew Su (talk) 15:59, 28 June 2018 (UTC)
In gout (Q133087), for example, this bot removed ICD-9 codes, because the statements had only DO references. However, gout (Q133087) had already had ICD-9 codes before this bot began to update this item ( Is it really correct? --Okkn (talk) 17:49, 28 June 2018 (UTC)
The algorithm you are using to decide to remove a statement is faulty. At present you decide unilaterally that the link to DO is the only reference for an ICD-9. That is false because the ICD-9 is a reference for itself, as is the case for many identifiers - in other words, anyone can verify that the ICD-9 code is accurate for the given entity simply by following the link constructed by the property. In other words, it doesn't matter whether the ICD-9 code exists in DO or not, it is still verifiable without any reference to DO. Now, please stop removing accurate information from the database simply because you have faulty code in your bot. --RexxS (talk) 21:40, 30 June 2018 (UTC)

To follow back up on this. At this time, the xrefs are back in DO and are back in Wikidata. Furthermore, I've started adding in Mondo data, and so there is a second source of information, and on this item in particular, there are ICD9 codes from both Mondo and DO.

Discussing what happened in two steps: 1) Initially (before ProteinBoxBot), the item had ICD9 statements that had been imported from English Wikipedia. The bot replaced them with the DO reference. At the time, this felt like the most reasonable thing to do, as we were not sure if the data had been reviewed/curated by anyone, and where the data came from (before Wikipedia). In retrospect, it would have been better to leave the "imported from" english Wikipedia references. 2) After DO removed the ICD9 xrefs, the bot removed those statements from Wikidata, as it should have. An identifier is not a reference for itself. An entity can have multiple different sources for multiple conflicting sets of xrefs. While it is true than an individual can click on the link to verify if the ID is correct, this is not the same as a primary resource, or external organization, stating that this xref is correct, (and also specifying the external ID to which the xref is a cross-reference of).

If Wikidata wants to be a self-contained, primary source of IDs like these, then, I think, the correct course of action would be to import the cross-references from the external resource itself, and add a reference back to that original resource. For instance in the case of ICD9, the ICD9 entities could be matched up the correct disease items in Wikidata and reference added (which would include the date retrieved, or version number, etc). Alternatively/In addition, this could be jump-started with the existing mappings.. Gstupp (talk) 21:51, 3 July 2018 (UTC)

Subclass of self? (I think due to two different DOID's)Edit

This edit made myocardial infarction (Q12152) a subclass of itself. This seems wrong. I think the problem is caused by having two distinct Disease Ontology ID (P699) values on this item: DOID:5844 and DOID:9408. The latter is labeled "acute myocardial infarction" at the DOID website, so I think this item needs to be split into two, but I'm not a medical expert... ArthurPSmith (talk) 12:53, 26 June 2018 (UTC)

  Done @ArthurPSmith: I have undone the invalid merge, and now myocardial infarction (Q12152) and acute myocardial infarction (Q18558122) are separated. --Okkn (talk) 19:15, 26 June 2018 (UTC)
thanks! Gstupp (talk) 20:00, 26 June 2018 (UTC)

location (P276) on disease items should be replaced with anatomical location (P927)Edit

ProtainBoxBot is importing “located in” relations from Disease Ontology by using location (P276), and it causes value type violations (ex. pancreatic cancer (Q212961), skin disease (Q949302)). For the anatomical structures, anatomical location (P927) should be used instead of location (P276). --Okkn (talk) 19:39, 26 June 2018 (UTC)

Ah, that is definitely better. Will change. Thanks Gstupp (talk) 20:00, 26 June 2018 (UTC)   Done Gstupp (talk) 21:51, 3 July 2018 (UTC)

Alcoholic disordersEdit

I think that this probably needs a manual review:

I'm not even sure if those Wikipedia articles are all on the same subject. WhatamIdoing (talk) 19:07, 28 July 2018 (UTC)

I removed the wrong xrefs. Yes, I think some of the Wikipedia articles should probably go with alcohol and health (Q11290178), but I don't speak those languages!! Gstupp (talk) 19:30, 30 July 2018 (UTC)

Something blew up Saturday - huge number of "subclass of self" entries now!?Edit

See Wikidata:WikiProject Ontology/Problems/subclass of self. These seem to be based on a huge number of edits by ProteinBoxBot and KrBot, can you track down what happened? Bad merges of some sort? ArthurPSmith (talk) 12:47, 30 July 2018 (UTC)

Some of them may be because of the multiple (invalid) MonDO ID (P5270) statements. --Okkn (talk) 13:05, 30 July 2018 (UTC)
Seems to be fixed now - thanks! ArthurPSmith (talk) 17:10, 31 July 2018 (UTC)
Hi ArthurPSmith. I am adding disease subclass statements and cross-references from both Disease Ontology (DO) and Monarch Disease Ontology (MONDO), which are both disease ontologies with different but complementary methods for classifying diseases. Both of these ontologies may classify diseases in different ways and so there may be differences in the subclass structure between them. There was a bug in the code for determining if a MONDO class should be merged into an existing wikidata item that affected ~500 of the ~20k diseases. I reverted all of those edits (as you just saw). I'm working on fixing them now. Gstupp (talk) 18:35, 31 July 2018 (UTC)

Another MONDO issueEdit

skin disease (Q949302) was recently made a subclass of rare skin disease (Q55788696), but that's clearly in the wrong direction - it looks like another issue with the identifiers, MONDO:0019043 and Orphanet 68346 appear to be for "rare genetic skin disease", not for any generic skin disorder. Some new items need to be created for this perhaps? ArthurPSmith (talk) 14:19, 2 August 2018 (UTC)

  Done Ok, I cleaned these up! Thanks for pointing it out Gstupp (talk) 22:18, 2 August 2018 (UTC)
This edit is still wrong. --Okkn (talk) 12:26, 4 August 2018 (UTC)

Chromosome valuesEdit

Thank you always for maintaining many data. Today, I found some Value type violation data for chromosome (P1057) at Wikidata:Database reports/Constraint violations/P1057. Although I can change data manually by myself, I know that your team periodically updating data. So to avoid flip-flopping of data editing, I inform that here.

Thanks! --Was a bee (talk) 08:25, 4 August 2018 (UTC)

More loopsEdit

(1) Problem with narcolepsy (Q189561) from this set of edits - possibly we need a separate item for Gélineau disease? and (2) with Emery-Dreifuss muscular dystrophy (Q1335642) form this set of edits - again possibly EDMD2 should have its own item? ArthurPSmith (talk) 14:25, 8 August 2018 (UTC)

Ok, split up. There were issues with the xrefs! Thanks Gstupp (talk) 20:18, 8 August 2018 (UTC)



You have created many duplicates, for sample Q55015731 for Q19001335. This seems to come (un)deprecated DOID[2] ? Can you do something to merge them ?

A suggestion for the future : there is already so many disease items in Wikidata that chance are good to create duplicate. Maybe it would be better to use mix'n match tools instead of creating directly Wikidata items for new DOID ?

Ske (talk) 09:30, 17 August 2018 (UTC)

Ske Do you have an idea of how many duplicates there are? Gstupp (talk) 21:15, 23 August 2018 (UTC)
All right, I've merged over 1000 diseases.. See log. Gstupp (talk) 22:37, 4 September 2018 (UTC)

Active ingredientEdit

Hi. As you may know, actual ingredients of drugs are often forming salts. For example, active ingredient of Allegra (Q48828913) should be fexofenadine hydrochloride (Q27255526) [3], although currently the has active ingredient (P3781) value of Allegra (Q48828913) is fexofenadine (Q415122). At this time we don't have way to link between fexofenadine (Q415122) and fexofenadine hydrochloride (Q27255526), cefazolin (Q415739) and cefazolin sodium (Q27106104), etc..., so we may have to create new properties and to organize their relations. In that case, is it possible for your bot to distinguish one chemical compound and its salts? Data sources you are importing from correctly distinguish them? Or do you have any good plan to deal with this problem? Many thanks, --Okkn (talk) 15:04, 23 August 2018 (UTC)

Okkn, yes, we're aware of this but haven't implemented a solution really. When we imported the products and active ingredients, the decision was made to use the chemical itself without the salt so that different products with the same active ingredient but different salt forms would all still be linked to the same chemical. In the future, there could be a property "precise active ingredient" (or something), that would link to the specific salt form. This is similar to the way its done in RxNorm. (e.g. Prozac -> "Tradename of" -> Fluoxetine, and "Has precise ingredient" -> Fluoxetine Hydrochloride). We could then also have "Fluoxetine Hydrochloride" -> "Form of" -> "Fluoxetine" (which is how its done in rxnorm (link)). As many of these compounds have rxnorm cuis, we could implement a solution like this. As of right now, this is lower priority on our end, but I'd be happy to work with you on proposing some properties and getting this started.. Gstupp (talk) 21:14, 23 August 2018 (UTC)
@Gstupp: I'm grad to hear that. "Precise active ingredient" property may works, but to eliminate the problem totally, we may have to distinguish umbrella term (drug) and single concept (unique substance). When we talk about fluoxetine (Q422244) as a drug, that does not only refers to the substance whose chemical formula (P274) is "C₁₇H₁₈F₃NO", but also refers to fluoxetine hydrochloride (Q27280620) ("C₁₇H₁₉ClF₃NO") and other salts. KEGG, for example, has "Fluoxetine (DG00942)" as a "Chemical DGroup" (Chemical structure group?), and both "Fluoxetine (D00326)" and "Fluoxetine hydrochloride (D00823)" are members of it. How about introducing this kind of group concepts, and moving some properties such as active ingredient in (P3780) and medical condition treated (P2175) from fluoxetine (Q422244) to this new item? --Okkn (talk) 07:31, 24 August 2018 (UTC)


authority (P797) is a wrong qualifier for this [4] as it's not something related to politics or any executive authority. approved by (P790) or maybe some other properties seem much better here. Wostr (talk) 23:49, 23 October 2018 (UTC)

Hi Wostr, Thanks for pointing this out. I don't think that authority (P797) is necessarily wrong though. The FDA is an agency with executive authority. This qualifier has also been used on EMA approved drugs for the past year link, so I worry about changing them all. What do others think? Gstupp (talk) 18:09, 24 October 2018 (UTC)
What I see from many labels and descriptions and also from properties in this property is that P797 is reserved for politics or organisations (i.e. for governing body (Q5588651)). For uses like this one, approved by (P790), maintained by (P126) etc. are used. P790 seems the most appropriate here; maybe some French labels in P797 or English broad label seems okay at first glance, but in some languages the labels are correct (in relation to governing body (Q5588651)), cannot be easily broaden to match English label and are quite nonsensical in cases like this. Also, using such non-standard qualifiers can make re-use of this data more difficult in the future. Best, Wostr (talk) 18:50, 24 October 2018 (UTC)

Neoplasm is not a subclass of benign neoplasmEdit

Many types of neoplasms are incorrectly stated as a subclass of benign neoplasm. See anus neoplasm, for example. Mahdimoqri (talk) 05:10, 18 November 2018 (UTC)

I made an issue here. Thanks Gstupp (talk) 18:48, 19 November 2018 (UTC)

Why mark GO terms as deprecated?Edit

Example. It's the only item with that GO term, so now SPARQL doesn't (by default) find any items with that GO term, breaking my scripts. Any specific reason for deprecating that GO term? If not, please revert this edit, and all other GO term deprecations. --Magnus Manske (talk) 16:26, 6 February 2019 (UTC)

Hi @Magnus Manske: we add the deprecated rank here on WD when the term is marked as "obsolete" by the Gene Ontology consortium. For example, here is the GO page for single organismal cell-cell adhesion (Q14863396): We left it as deprecated instead of deleting the WD item so that people have a record that it used to be a valid GO term. This still seems like the best behavior to me, but certainly open to discussion if you feel differently... (What's a bit more confusing to me is why annotations to deprecated GO terms exist -- we will investigate.) Best, Andrew Su (talk) 17:52, 6 February 2019 (UTC)
Thanks, that makes sense! --Magnus Manske (talk) 08:31, 7 February 2019 (UTC)

Accidental GO term removalsEdit

I'm afraid my bot didn't check properly before removing some GO terms (example) that your had added. I have changed the code to not remove any GO terms that have a curator (P1640) that is not GeneDB (Q5531047), but some damage was done. I am trying to have the changes for that species reverted, but there may be others. Feel free to add them again, my bot should respect them next time! --Magnus Manske (talk) 15:10, 1 March 2019 (UTC)

I think those would be automatically added back in on the next bot run, but we'll keep an eye on it and confirm... Best, Andrew Su (talk) 17:54, 1 March 2019 (UTC)

Once again, a bug in my bot code has caused the removal of your GO terms from some items with GeneDB ID (P3382), for found in taxon (P703):Plasmodium falciparum 3D7 (Q61779043) (example edit). The bug is fixed now, but it is probably best to wait for your bot to re-add them. --Magnus Manske (talk) 11:54, 27 June 2019 (UTC)

Possibly related accountEdit

Is Torogertu related to this account? Some edits (e.g. Special:Diff/874634275/915402367) indicate to me that perhaps the user account is being misused, but I don't really know what to make of it. Jc86035 (talk) 12:20, 16 April 2019 (UTC)

Hi Jc86035, I'm working on the disease ontology bot and I may have accidentally let it run wild. I'll clean it up. Torogertu (talk) 15:25, 16 April 2019 (UTC)
... and just add a tiny bit more detail, yes, Torogertu is a new member of the team running this ProteinBoxBot account. He was doing some test edits on his user account to prototype an enhancement, but then accidentally forgot to set the 'test' flag that would have prevented the actual write. Thank you for the heads up, and again, we'll work on fixing things asap... Best, Andrew Su (talk) 15:53, 16 April 2019 (UTC)
Jc86035, thanks again for catching this. I went back to the code and realized I git-pulled the master bot vs the branch bot where I made all my edits (including the 'test' flag). I double checked that I made no edits to the master bot prior to the test-run, and found no changes. I believe that if PBB were to be run again, it would write the same things that I had written on my user account. I confirmed this through looking through a couple dozen of my edits. As all the edits I observed would probably be written by PBB in the future, I don't plan on removing the edits. I'll be more careful next time. Torogertu (talk) 04:54, 17 April 2019 (UTC)

another subclass loopEdit

This edit created a subclass loop between cerebellar ataxia (Q154709) and hereditary ataxia (Q21082497). I'm assuming there's a problem with one or more of the identifiers, or is this loop actually in your source reference? ArthurPSmith (talk) 16:08, 20 April 2019 (UTC)

@ArthurPSmith: Hmm, interesting example. It appears that the subclass loop is due to the fact that two data sources Disease Ontology release 2019-04-18 (Q63226230) and Monarch Disease Ontology release 2018-06-29sonu (Q55345445) disagree on the direction of that subclass relationship. I don't have the expertise to judge which is correct, and that's certainly not a call that I'd want our bot to make automatically. Given that Wikidata is a database of assertions and not facts, it seems like we want to allow for capturing this type of disagreement. Of course, I understand that this complicates usage of Wikidata by reasoners. Do you have a suggestion on how this could be better modeled? Or is the reference something that reasoners just need to account for? Best, Andrew Su (talk) 22:46, 22 April 2019 (UTC)
If it's really in the sources that's probably fine - but we may want to notify them about the disagreement. Reasoners will have to deal with stuff like that I guess! ArthurPSmith (talk) 11:33, 23 April 2019 (UTC)

RfC about enzymsEdit

There's currently an RfC about bot created items for enzymes: ChristianKl❫ 07:49, 19 June 2019 (UTC)

Return to the user page of "ProteinBoxBot".