Merge?Edit

Hi,

Can we merge Q22274764 and fermentation (Q41760) ?

Thank you. Tubezlob (🙋) 16:40, 6 July 2017 (UTC)

Done, thanks! Gstupp (talk) 17:20, 6 July 2017 (UTC)

DescriptionsEdit

Looking for descriptions starting with "a ..", I came across a few generated by this bot (sample correction). For future runs, maybe the initial "a", underscores in text and final dots can be removed directly.
--- Jura 14:16, 7 September 2017 (UTC)

Hi Jura, I've made an issue on our github to address this. Gstupp (talk) 23:30, 7 September 2017 (UTC)
Is there an official style guide for descriptions? The suggestions seem reasonable, but unless they were part of an official style guide I think keeping the status quo could be convincingly argued as well (to keep in sync with the source databases). Thoughts? Andrew Su (talk) 00:23, 8 September 2017 (UTC)
We have one: It's Help:Descriptions. --Izno (talk) 20:02, 8 September 2017 (UTC)
Thanks for the link, very helpful... Best, Andrew Su (talk) 20:28, 8 September 2017 (UTC)
  • Maybe I need to ping the operators as well @Andrawaag, Sebotic:.
    --- Jura 07:26, 8 September 2017 (UTC)
Both Andrew and Gstupp are also part of the team maintaining this bot. I concur with Andrew, that maintaining the description as they are in the original source, makes sense. --Andrawaag (talk) 07:45, 8 September 2017 (UTC) 07:44, 8 September 2017 (UTC)
It would be good to have an operator for this bot that is knowledgeable about Wikidata rules. We already had to block it in the past and you should avoid that this re-occurs.
--- Jura 08:03, 8 September 2017 (UTC)
I've updated the bot to use take care of this: https://github.com/SuLab/GeneWikiCentral/issues/62 Gstupp (talk) 23:08, 2 October 2017 (UTC)

ICD-10-CM and ICD-10 codes are not necessarily interchangeable.Edit

I've reverted your change of the ICD-10 code over at REM sleep behavior disorder (Q2103933). G47.52 code is not found in ICD-10. This can be double checked by searching for it on the online version of ICD-10.[1] Whilst it is within ICD-10-CM; this now has it's own property: ICD-10-CM (P4229) Little pob (talk) 20:17, 23 September 2017 (UTC)

I've edited the bot to use ICD-10-CM and switched over codes to the appropriate system (as determined by the Disease Ontology). Gstupp (talk) 23:29, 25 September 2017 (UTC)

Cerebrovascular disease and strokeEdit

Hi! The en label of stroke (Q12202) should be "stroke", according to enwiki, and DOID:6713 (cerebrovascular disease) should be linked to cerebrovascular disease (Q3010352), instead of stroke (Q12202). Thanks. --Okkn (talk) 14:28, 22 October 2017 (UTC)

Hi Okkn. I think I've cleaned up the items and made an issue on DO. Gstupp (talk) 18:11, 2 November 2017 (UTC)

Thank you, Gstupp! --Okkn (talk) 10:18, 3 November 2017 (UTC)

Help with some ontology issues in diseases?Edit

I've been cleaning up some ontology issues in wikidata, but I'm struggling to figure out what to do with diseases. One set of problems is cycles - x subclass y subclass x. The list of remaining cases is here and you'll see that all the ones left there are in the area of diseases. Looking at them, the relationships seem to be based on the Disease Ontology in ways that don't make sense. For example, aseptic meningitis (Q4804182) is stated as a subclass of viral meningitis (Q3301664) based on the DOID's, but logically the subclass relationship should be the other way round, if those definitions are correct (viruses are a subset of non-bacterial causes). Is this a problem in the Disease Ontology, or with the relationships that have been entered here? We really need an expert or two to help out on this, I'd appreciate if you aren't able to do it if you could point to somebody who can help. Thanks! ArthurPSmith (talk) 15:24, 2 November 2017 (UTC)

Thanks ArthurPSmith, I will look into this. Gstupp (talk) 18:12, 5 November 2017 (UTC)
I've gone through and fixed some of the ones that were added by users (not from Disease Ontology) that are wrong. There are others, like the virus one, that are possibly issues with DO, but are confusing to me. I've shared with with the DO team and will update. Gstupp (talk) 00:01, 8 November 2017 (UTC)
Thanks! There's also one three-level cycle in diseases maybe you could check out too? Wikidata:WikiProject Ontology/Problems/3rd-order subclass of self - I was trying to sort it out myself but got very confused between the different language wikipedias what was going on here regarding macular degeneration. ArthurPSmith (talk) 16:30, 9 November 2017 (UTC)

“remove deprecated statements”Edit

You changed some items with the edit summary “remove deprecated statements” (example). What does that mean, and what shall we do with the remaining, almost empty items? —MisterSynergy (talk) 08:17, 3 November 2017 (UTC)

Hello MisterSynergy, Thanks for pointing this out. This is the result of inconsistencies in the external IDs between Robinow Syndrome and its subclasses. The DO bot failed to update the statements on this item because the external IDs conflicted with the external IDs in another item. You can see the error in the log file row 1337.
I've added back the DO ID on this, so when the bot runs again it will re-add the current statements. I've also created an issue in DO to fix the IDs.
There are 34 other items in the log that also probably have this issue, and so I'll take a look at them also..
Gstupp (talk) 00:14, 8 November 2017 (UTC)

Ok, I've fixed up the others. Found using this sparql query: link Gstupp (talk) 23:29, 8 November 2017 (UTC)

Thanks, looks good indeed! MisterSynergy (talk) 08:54, 9 November 2017 (UTC)

has listed ingredient (P4543) and drugsEdit

We imported a lot of data about drugs from OpenFDA. As far as I remember the reason why only store has active ingredient (P3781) was that at the time we had no good property for the other ingredients. ChristianKl () 19:46, 22 November 2017 (UTC)

Hi ChristianKl, Most of the active ingredient information actually came from the EMA. (Example) The inactive ingredients are sadly not available in a structured form (that I can find..). Same story with OpenFDA. It looks like many drug labels have brand name and active ingredients, (along with with UNII so we don't have to string match!!), but no other ingredients. They only exist in the free text package labeling and would be a lot of work to pull out and normalize. In addition, the indications are not structured in any way, which is why we grabbed them from EMA. See the openFDA field in this. Gstupp (talk) 20:02, 22 November 2017 (UTC)

This bot created a subclass loop!Edit

Preprotein translocase subunit SecE (Q24738466) was just made a subclass of Protein translocase SEC61 complex, gamma subunit (Q24768152), which was just made a subclass of Preprotein translocase subunit SecE (Q24738466)! Something's gone wrong there.

Also, any progress on resolving the remaining subclass loops in diseases? See Wikidata:WikiProject Ontology/Problems/subclass of subclass of self - thanks! ArthurPSmith (talk) 15:02, 28 November 2017 (UTC)

Hello Arthur. You're too fast! The bot was in the middle of a bot run and hadn't finished updating all items when you posted. The run has completed and I don't see any loops.

I submitted two issues for the remaining subclass loops 1 2 Gstupp (talk) 20:27, 28 November 2017 (UTC)

@Gstupp: yes, it looks better, thanks, and thanks for posting those issues! ArthurPSmith (talk) 21:33, 28 November 2017 (UTC)

cell (Q7868) subclass of (P279) cellular component (Q5058355)?Edit

Cell is a part of a cell? --Fractaler (talk) 07:22, 20 December 2017 (UTC)

Yes. According to the reference: Gene Ontology, cell is a cellular_component. Subclass does not mean "part of". Gstupp (talk) 18:22, 20 December 2017 (UTC)

Now cellular component (Q5058355) (cellular component) have description: "part of a cell". Right? Fractaler (talk) 06:07, 21 December 2017 (UTC)

Do not remove statements automatically created by other usersEdit

Hi. I noticed that ProteinBoxBot removes instance of (P31), subclass of (P279) and has part (P527) statements created by other users. (ex. https://www.wikidata.org/w/index.php?title=Q7868&type=revision&diff=625609569&oldid=623112989 https://www.wikidata.org/w/index.php?title=Q187082&diff=prev&oldid=625594775 https://www.wikidata.org/w/index.php?title=Q177708&diff=prev&oldid=625602294 https://www.wikidata.org/w/index.php?title=Q582477&diff=prev&oldid=625596085 https://www.wikidata.org/w/index.php?title=Q79899&diff=prev&oldid=625609728). I think they should not be removed automatically even if they are not defined in Gene Ontology. --Okkn (talk) 05:15, 2 February 2018 (UTC)

@Okkn: Thanks for bringing this to our attention -- those are definitely unintentional changes. The primary person to look at this is out of the office at the moment, but we'll get this addressed early next week. Apologies, and thanks again for the bug report! Best, Andrew Su (talk) 17:09, 2 February 2018 (UTC)
Hi @Okkn:, I've implemented the changes here. I'll look into seeing if I can revert what was overwritten. Thanks for pointing this out. Gstupp (talk) 20:23, 6 February 2018 (UTC)

To prevent items from being P31 and P279* of the same classEdit

I think using instance of (P31) to indicate the semantic type of the item is useful to query. On the other hand, ProteinBoxBot creates lots of instance of (P31) statements which results in having instance of (P31) and subclass of (P279)* of the same class (such as disease (Q12136), gene (Q7187), biological process (Q2996394), cellular component (Q5058355) and molecular function (Q14860489)), and some ontologists regard this as a problem.

To resolve this problem, I propose that we create new metaclasses (first-order metaclass (Q24017414)), like type of sport (Q31629), cell type (Q189118) or type of mathematical function (Q47279819), corresponding to the above classes, and that we replace disease (Q12136), gene (Q7187) etc. in instance of (P31) statements to the metaclasses ("disease class" or "type of gene"). --Okkn (talk) 06:08, 4 February 2018 (UTC)

Subclass of self is just wrong!Edit

ProteinBoxBot has been making a number of edits like this one that assert that something is a subclass of itself. This is meaningless. I will remove them, but please ensure they don't recur. ArthurPSmith (talk) 21:25, 27 February 2018 (UTC)

Same thing for instance of (P31) (I just removed 3 just now on cellular component (Q5058355), molecular function (Q14860489) and biological process (Q2996394)). Cdlt, VIGNERON (talk) 09:09, 11 May 2018 (UTC)
Thanks for pointing these out Gstupp (talk) 19:39, 11 May 2018 (UTC)

Redundant aliasesEdit

In edits like this one, the bot seems to be adding "Name (disorder)" as an alias (complete with unwanted capitalization). Even if that's in a source, it's probably not correct to be adding the source's disambiguator to the Wikidata record.

Also, for edits such as this one, is there a way to tell it to stop adding aliases after they've been corrected? Abbreviations such as "acute/subac." should be spelled out, and the abbreviated version shouldn't be used at all.

(Please ping me if you have any questions.) WhatamIdoing (talk) 22:30, 5 March 2018 (UTC)

Hi WhatamIdoing. Yes, these "Name (disorder)" aliases are present in the source (DO). I agree its not useful to have this. I'll edit the bot to filter those out. Are there any others you noticed? I see "morphologic abnormality", and "finding" as well.

For the second issue, this is a lot harder to address. I have no way of knowing that the abbreviated version was corrected without getting the full history of every item... I can filter "acute/subac." out as well. But I only see one item containing "subac.". Have you seen other abbreviations? Gstupp (talk) 00:33, 6 March 2018 (UTC)

I think that the ideal behavior is to make sure that "Name" (or "name") is present, and if not, to add the name without the (disorder) appended. (I don't know if that's easy to code, though.
I think that "NOS" is the most common abbreviation, and it's probably just as irrelevant as (disorder), but as it theoretically contains some content ("not otherwise specified"), I have slightly more sympathy for it. WhatamIdoing (talk) 06:26, 6 March 2018 (UTC)

More subclass loops from imported ontologiesEdit

We now have regulation of leucine import across plasma membrane (Q27303095) subclass of regulation of leucine import (Q22303228), which is in turn a subclass of regulation of leucine import across plasma membrane (Q27303095) thanks to your recent edits - in this case both relations are referenced to "Gene Ontology", the first dated 9 May 2017, and the second dated 6 March 2018. The new relation doesn't make sense to me given the labels - maybe something's gone wrong with the ID's? And we now also have lactic acidosis (Q1500373) subclass of metabolic acidosis (Q1598200), which is in turn a subclass of lactic acidosis (Q1500373), this time both sourced to "Disease Ontology" releases, the first from 5 March 2018 and the second from 5 December 2017. Has this source reversed this relation in the last few months? In any case, assuming the source does not contain both statements now, the one that is no longer "valid" should at least be deprecated (I think I would prefer it to be removed altogether, but maybe recording the old version of the relation is useful for some purpose). ArthurPSmith (talk) 14:42, 6 March 2018 (UTC)

@ArthurPSmith: Metabolic acidosis was fixed about a week ago. https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/472 --Okkn (talk) 05:26, 7 March 2018 (UTC)
The disease ones should be fixed. For the GO ones, I made an issue and am working on this. Gstupp (talk) 19:44, 14 March 2018 (UTC)
That was me I think! I've been working on mesh items and accidentally merged coloring agents and food coloring. Thanks for fixing it. Gstupp (talk) 19:44, 14 March 2018 (UTC)

aromatase: Wikipedia links moved from the enzyme to the geneEdit

Hi,

It seems ProteinBoxBot really wants the Wikipedia articles to be linked by the gene element rather than the enzyme:

I've added them back (and the commons category) to the enzyme, but won't ProteinBoxBot move them again? 06:10, 21 March 2018 (UTC)

User:The_RedBurn, This article (and indeed most articles about human genes/proteins) is about both the gene and protein, and it makes it consistent to have all of these articles linked to the gene wikidata item. Additionally, the infobox on the Wikipedia page requires that the page be linked to the gene item and not the protein. As right now, the infobox is now displaying NA. Gstupp (talk) 18:16, 21 March 2018 (UTC)
It seems rather strange to link the articles (which all seem to mainly talk about the enzymes (and then about their coding genes)) to the genes instead of the enzymes. The drawback of doing so is that the Wikipedia mobile apps and VisualEditor describe those enzymes as genes, which may confuse the user. Is there any other reason than "most/all of these articles are linked to the gene wikidata item"? About the infobox, that's just a technical detail, I've fixed it for now. The RedBurn (ϕ) 19:05, 21 March 2018 (UTC)
The_RedBurn: I agree that articles talking mostly about proteins should be linked on the protein item, and I'm fixing this manually where I can. If you state that these articles are about gene+protein then you need a concept for that to model it on WD, and then create instances of that concept that have-part gene and protein items and then you can put sitelink on the concept instance. --SCIdude (talk) 09:10, 1 September 2019 (UTC)

Trying to clean up NCI Thesaurus cross-references before Cellosaurus wikipedia bot starts workingEdit

I am trying to add many new disease terms (mainly in 3 categories: 1) animal disease terms (which are not problematic for ProteinBoxBot as you do not have them in DO), 2) cancer terms (mostly children of existing ones, so here again there is no problems), 3) genetic disease terms (and here there are many problems).

The problems are that the mapping of disease ontology to NCI Thesaurus are: 1) incomplete, 2) partially wrong, 3) somehow do not go to the right specific wikidata entry). Example: for neurofibromatosis (Q847605), the bot wants to add C3273 which I have added to the correct Wikidata entry: neurofibromatosis type I (Q7616509) and similarly C3274 was added to neurofibromatosis type II (Q1935832).

How do we go forward to correct all these errors and inconsistencies? --Amb sib (talk) 20:07, 20 May 2018 (UTC)

Removing valid MeSH ID as deprecated stetementsEdit

Hi, ProteinBoxBot has removed valid MeSH IDs and other statements:

You should undo them. Regards, --Okkn (talk) 07:45, 26 June 2018 (UTC)

It looks like these removed in DO and then added back in for some reason. There was a new release this morning with them added back, so the bot should re-add them... Gstupp (talk) 20:01, 26 June 2018 (UTC)

The bot also removed ICD-9-CM (P1692), for example this edit. ICD-9 is old, but despite the recent publication of ICD-11, it is still in common use and is not deprecated. Please roll back these erroneous removals. --RexxS (talk) 22:28, 26 June 2018 (UTC)

Ya, that's a good question.. Thanks for bringing this up. They were removed by DO and so were removed by the DO bot. I made an issue on their issue tracker: https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/532. Gstupp (talk) 23:07, 26 June 2018 (UTC)
Do you realise that the bot is not only removing valid information from Wikidata but also from all the Wikipedias who derive their ICD9 information from Wikidata? Wikidata is not Disease Ontology and we should not be hostage to their miscalculations. Bot operators are responsible for the edits made by their bots and must take responsibility for correcting their errors. Is it necessary to ask for administrator assistance to have these mistakes rolled back? --RexxS (talk) 19:18, 27 June 2018 (UTC)
The goal of the DO bot is to accurately reflect what DO says. It is not a disease bot, its a DO bot. If there are non DO references for a statement (or no references at all), the bot will leave it alone. The bot only removed statements whose only reference was DO. Course of action: 1) I don't know why they where removed from DO, but they should be added back. 2) We are adding disease info from another source (MONDO), and so will have two sources of information. 3) In a broader sense, I think we should strive to have multiple independent sources of information (with references for each). In this way, the trustworthiness of a statement can be better assessed, and as a byproduct, the impact of sources going rogue is better mitigated. Gstupp (talk) 20:47, 27 June 2018 (UTC)
When it comes to these identifiers the reference is in the code itself. The code is the reference to the ICD handbook, so DO is entirely irrelevant here. This is a major issue and the idea of using DO as a reference is faulty. DO is not a reference, it is a directory of related links and queries. DO may be used as an additional link, but these codes are not unreferenced. I think that applies to all DO-referenced codes, so you'll have to stop doing that entirely.CFCF (talk) 14:16, 28 June 2018 (UTC)
On the contrary, DO is relevant here. The source for the claim came from DO, in the first place. As far as I know, there hasn't been any effort to get the ICD handbook in Wikidata. So when you are looking at these disease statement you do see ICD codes if they exist as mappings in the disease ontology. If the disease ontology, no longer substantiates those claims they need to be removed since the reference is no longer accurate. The solution here would be that that ICD handbook in its entirely would be added to Wikidata, or if other resources would add mapping to ICD codes. --Andrawaag (talk) 15:15, 28 June 2018 (UTC)
I agree. I think if there were statements that had a second reference (to the ICD handbook or some other source), then the bot should just delete the DO reference. In these cases where DO was the only reference stated, when/if the latest version of DO ceases to make those statements then they should be removed from Wikidata. So I think the bot is working correctly here (and that I hope the DO team restores those links ASAP, per the github issue). Best, Andrew Su (talk) 15:59, 28 June 2018 (UTC)
In gout (Q133087), for example, this bot removed ICD-9 codes, because the statements had only DO references. However, gout (Q133087) had already had ICD-9 codes before this bot began to update this item (https://www.wikidata.org/w/index.php?title=Q133087&oldid=67703678). Is it really correct? --Okkn (talk) 17:49, 28 June 2018 (UTC)
The algorithm you are using to decide to remove a statement is faulty. At present you decide unilaterally that the link to DO is the only reference for an ICD-9. That is false because the ICD-9 is a reference for itself, as is the case for many identifiers - in other words, anyone can verify that the ICD-9 code is accurate for the given entity simply by following the link constructed by the property. In other words, it doesn't matter whether the ICD-9 code exists in DO or not, it is still verifiable without any reference to DO. Now, please stop removing accurate information from the database simply because you have faulty code in your bot. --RexxS (talk) 21:40, 30 June 2018 (UTC)

To follow back up on this. At this time, the xrefs are back in DO and are back in Wikidata. Furthermore, I've started adding in Mondo data, and so there is a second source of information, and on this item in particular, there are ICD9 codes from both Mondo and DO.

Discussing what happened in two steps: 1) Initially (before ProteinBoxBot), the item had ICD9 statements that had been imported from English Wikipedia. The bot replaced them with the DO reference. At the time, this felt like the most reasonable thing to do, as we were not sure if the data had been reviewed/curated by anyone, and where the data came from (before Wikipedia). In retrospect, it would have been better to leave the "imported from" english Wikipedia references. 2) After DO removed the ICD9 xrefs, the bot removed those statements from Wikidata, as it should have. An identifier is not a reference for itself. An entity can have multiple different sources for multiple conflicting sets of xrefs. While it is true than an individual can click on the link to verify if the ID is correct, this is not the same as a primary resource, or external organization, stating that this xref is correct, (and also specifying the external ID to which the xref is a cross-reference of).

If Wikidata wants to be a self-contained, primary source of IDs like these, then, I think, the correct course of action would be to import the cross-references from the external resource itself, and add a reference back to that original resource. For instance in the case of ICD9, the ICD9 entities could be matched up the correct disease items in Wikidata and reference added (which would include the date retrieved, or version number, etc). Alternatively/In addition, this could be jump-started with the existing mappings.. Gstupp (talk) 21:51, 3 July 2018 (UTC)

Subclass of self? (I think due to two different DOID's)Edit

This edit made myocardial infarction (Q12152) a subclass of itself. This seems wrong. I think the problem is caused by having two distinct Disease Ontology ID (P699) values on this item: DOID:5844 and DOID:9408. The latter is labeled "acute myocardial infarction" at the DOID website, so I think this item needs to be split into two, but I'm not a medical expert... ArthurPSmith (talk) 12:53, 26 June 2018 (UTC)

  Done @ArthurPSmith: I have undone the invalid merge, and now myocardial infarction (Q12152) and acute myocardial infarction (Q18558122) are separated. --Okkn (talk) 19:15, 26 June 2018 (UTC)
thanks! Gstupp (talk) 20:00, 26 June 2018 (UTC)

location (P276) on disease items should be replaced with anatomical location (P927)Edit

ProtainBoxBot is importing “located in” relations from Disease Ontology by using location (P276), and it causes value type violations (ex. pancreatic cancer (Q212961), skin disease (Q949302)). For the anatomical structures, anatomical location (P927) should be used instead of location (P276). --Okkn (talk) 19:39, 26 June 2018 (UTC)

Ah, that is definitely better. Will change. Thanks Gstupp (talk) 20:00, 26 June 2018 (UTC)   Done Gstupp (talk) 21:51, 3 July 2018 (UTC)

Alcoholic disordersEdit

I think that this probably needs a manual review: https://www.wikidata.org/w/index.php?title=Q1340700&diff=0&oldid=682155292

I'm not even sure if those Wikipedia articles are all on the same subject. WhatamIdoing (talk) 19:07, 28 July 2018 (UTC)

I removed the wrong xrefs. Yes, I think some of the Wikipedia articles should probably go with alcohol-induced disorders (Q11290178), but I don't speak those languages!! Gstupp (talk) 19:30, 30 July 2018 (UTC)

Something blew up Saturday - huge number of "subclass of self" entries now!?Edit

See Wikidata:WikiProject Ontology/Problems/subclass of self. These seem to be based on a huge number of edits by ProteinBoxBot and KrBot, can you track down what happened? Bad merges of some sort? ArthurPSmith (talk) 12:47, 30 July 2018 (UTC)

Some of them may be because of the multiple (invalid) MonDO ID (P5270) statements. --Okkn (talk) 13:05, 30 July 2018 (UTC)
Seems to be fixed now - thanks! ArthurPSmith (talk) 17:10, 31 July 2018 (UTC)
Hi ArthurPSmith. I am adding disease subclass statements and cross-references from both Disease Ontology (DO) and Monarch Disease Ontology (MONDO), which are both disease ontologies with different but complementary methods for classifying diseases. Both of these ontologies may classify diseases in different ways and so there may be differences in the subclass structure between them. There was a bug in the code for determining if a MONDO class should be merged into an existing wikidata item that affected ~500 of the ~20k diseases. I reverted all of those edits (as you just saw). I'm working on fixing them now. Gstupp (talk) 18:35, 31 July 2018 (UTC)

Another MONDO issueEdit

skin disease (Q949302) was recently made a subclass of rare skin disease (Q55788696), but that's clearly in the wrong direction - it looks like another issue with the identifiers, MONDO:0019043 and Orphanet 68346 appear to be for "rare genetic skin disease", not for any generic skin disorder. Some new items need to be created for this perhaps? ArthurPSmith (talk) 14:19, 2 August 2018 (UTC)

  Done Ok, I cleaned these up! Thanks for pointing it out Gstupp (talk) 22:18, 2 August 2018 (UTC)
This edit is still wrong. https://www.wikidata.org/w/index.php?title=Q28757362&diff=720587492&oldid=709267838 --Okkn (talk) 12:26, 4 August 2018 (UTC)

Chromosome valuesEdit

Thank you always for maintaining many data. Today, I found some Value type violation data for chromosome (P1057) at Wikidata:Database reports/Constraint violations/P1057. Although I can change data manually by myself, I know that your team periodically updating data. So to avoid flip-flopping of data editing, I inform that here.

Thanks! --Was a bee (talk) 08:25, 4 August 2018 (UTC)

More loopsEdit

(1) Problem with narcolepsy (Q189561) from this set of edits - possibly we need a separate item for Gélineau disease? and (2) with Emery-Dreifuss muscular dystrophy (Q1335642) form this set of edits - again possibly EDMD2 should have its own item? ArthurPSmith (talk) 14:25, 8 August 2018 (UTC)

Ok, split up. There were issues with the xrefs! Thanks Gstupp (talk) 20:18, 8 August 2018 (UTC)

DuplicatesEdit

Hello,

You have created many duplicates, for sample Q55015731 for Q19001335. This seems to come (un)deprecated DOID[2] ? Can you do something to merge them ?

A suggestion for the future : there is already so many disease items in Wikidata that chance are good to create duplicate. Maybe it would be better to use mix'n match tools instead of creating directly Wikidata items for new DOID ?

Ske (talk) 09:30, 17 August 2018 (UTC)

Ske Do you have an idea of how many duplicates there are? Gstupp (talk) 21:15, 23 August 2018 (UTC)
All right, I've merged over 1000 diseases.. See log. Gstupp (talk) 22:37, 4 September 2018 (UTC)

Active ingredientEdit

Hi. As you may know, actual ingredients of drugs are often forming salts. For example, active ingredient of Allegra (Q48828913) should be fexofenadine hydrochloride (Q27255526) [3], although currently the has active ingredient (P3781) value of Allegra (Q48828913) is fexofenadine (Q415122). At this time we don't have way to link between fexofenadine (Q415122) and fexofenadine hydrochloride (Q27255526), cefazolin (Q415739) and cefazolin sodium (Q27106104), etc..., so we may have to create new properties and to organize their relations. In that case, is it possible for your bot to distinguish one chemical compound and its salts? Data sources you are importing from correctly distinguish them? Or do you have any good plan to deal with this problem? Many thanks, --Okkn (talk) 15:04, 23 August 2018 (UTC)

Okkn, yes, we're aware of this but haven't implemented a solution really. When we imported the products and active ingredients, the decision was made to use the chemical itself without the salt so that different products with the same active ingredient but different salt forms would all still be linked to the same chemical. In the future, there could be a property "precise active ingredient" (or something), that would link to the specific salt form. This is similar to the way its done in RxNorm. (e.g. Prozac -> "Tradename of" -> Fluoxetine, and "Has precise ingredient" -> Fluoxetine Hydrochloride). We could then also have "Fluoxetine Hydrochloride" -> "Form of" -> "Fluoxetine" (which is how its done in rxnorm (link)). As many of these compounds have rxnorm cuis, we could implement a solution like this. As of right now, this is lower priority on our end, but I'd be happy to work with you on proposing some properties and getting this started.. Gstupp (talk) 21:14, 23 August 2018 (UTC)
@Gstupp: I'm grad to hear that. "Precise active ingredient" property may works, but to eliminate the problem totally, we may have to distinguish umbrella term (drug) and single concept (unique substance). When we talk about fluoxetine (Q422244) as a drug, that does not only refers to the substance whose chemical formula (P274) is "C₁₇H₁₈F₃NO", but also refers to fluoxetine hydrochloride (Q27280620) ("C₁₇H₁₉ClF₃NO") and other salts. KEGG, for example, has "Fluoxetine (DG00942)" as a "Chemical DGroup" (Chemical structure group?), and both "Fluoxetine (D00326)" and "Fluoxetine hydrochloride (D00823)" are members of it. How about introducing this kind of group concepts, and moving some properties such as active ingredient in (P3780) and medical condition treated (P2175) from fluoxetine (Q422244) to this new item? --Okkn (talk) 07:31, 24 August 2018 (UTC)

P797Edit

authority (P797) is a wrong qualifier for this [4] as it's not something related to politics or any executive authority. approved by (P790) or maybe some other properties seem much better here. Wostr (talk) 23:49, 23 October 2018 (UTC)

Hi Wostr, Thanks for pointing this out. I don't think that authority (P797) is necessarily wrong though. The FDA is an agency with executive authority. This qualifier has also been used on EMA approved drugs for the past year link, so I worry about changing them all. What do others think? Gstupp (talk) 18:09, 24 October 2018 (UTC)
What I see from many labels and descriptions and also from properties in this property is that P797 is reserved for politics or organisations (i.e. for governing body (Q5588651)). For uses like this one, approved by (P790), maintained by (P126) etc. are used. P790 seems the most appropriate here; maybe some French labels in P797 or English broad label seems okay at first glance, but in some languages the labels are correct (in relation to governing body (Q5588651)), cannot be easily broaden to match English label and are quite nonsensical in cases like this. Also, using such non-standard qualifiers can make re-use of this data more difficult in the future. Best, Wostr (talk) 18:50, 24 October 2018 (UTC)

Neoplasm is not a subclass of benign neoplasmEdit

Many types of neoplasms are incorrectly stated as a subclass of benign neoplasm. See anus neoplasm, for example. Mahdimoqri (talk) 05:10, 18 November 2018 (UTC)

I made an issue here. Thanks Gstupp (talk) 18:48, 19 November 2018 (UTC)

Why mark GO terms as deprecated?Edit

Example. It's the only item with that GO term, so now SPARQL doesn't (by default) find any items with that GO term, breaking my scripts. Any specific reason for deprecating that GO term? If not, please revert this edit, and all other GO term deprecations. --Magnus Manske (talk) 16:26, 6 February 2019 (UTC)

Hi @Magnus Manske: we add the deprecated rank here on WD when the term is marked as "obsolete" by the Gene Ontology consortium. For example, here is the GO page for single organismal cell-cell adhesion (Q14863396): http://amigo.geneontology.org/amigo/term/GO:0016337. We left it as deprecated instead of deleting the WD item so that people have a record that it used to be a valid GO term. This still seems like the best behavior to me, but certainly open to discussion if you feel differently... (What's a bit more confusing to me is why annotations to deprecated GO terms exist -- we will investigate.) Best, Andrew Su (talk) 17:52, 6 February 2019 (UTC)
Thanks, that makes sense! --Magnus Manske (talk) 08:31, 7 February 2019 (UTC)

Accidental GO term removalsEdit

I'm afraid my bot didn't check properly before removing some GO terms (example) that your had added. I have changed the code to not remove any GO terms that have a curator (P1640) that is not GeneDB (Q5531047), but some damage was done. I am trying to have the changes for that species reverted, but there may be others. Feel free to add them again, my bot should respect them next time! --Magnus Manske (talk) 15:10, 1 March 2019 (UTC)

I think those would be automatically added back in on the next bot run, but we'll keep an eye on it and confirm... Best, Andrew Su (talk) 17:54, 1 March 2019 (UTC)

Once again, a bug in my bot code has caused the removal of your GO terms from some items with GeneDB ID (P3382), for found in taxon (P703):Plasmodium falciparum 3D7 (Q61779043) (example edit). The bug is fixed now, but it is probably best to wait for your bot to re-add them. --Magnus Manske (talk) 11:54, 27 June 2019 (UTC)

Possibly related accountEdit

Is Torogertu related to this account? Some edits (e.g. Special:Diff/874634275/915402367) indicate to me that perhaps the user account is being misused, but I don't really know what to make of it. Jc86035 (talk) 12:20, 16 April 2019 (UTC)

Hi Jc86035, I'm working on the disease ontology bot and I may have accidentally let it run wild. I'll clean it up. Torogertu (talk) 15:25, 16 April 2019 (UTC)
... and just add a tiny bit more detail, yes, Torogertu is a new member of the team running this ProteinBoxBot account. He was doing some test edits on his user account to prototype an enhancement, but then accidentally forgot to set the 'test' flag that would have prevented the actual write. Thank you for the heads up, and again, we'll work on fixing things asap... Best, Andrew Su (talk) 15:53, 16 April 2019 (UTC)
Jc86035, thanks again for catching this. I went back to the code and realized I git-pulled the master bot vs the branch bot where I made all my edits (including the 'test' flag). I double checked that I made no edits to the master bot prior to the test-run, and found no changes. I believe that if PBB were to be run again, it would write the same things that I had written on my user account. I confirmed this through looking through a couple dozen of my edits. As all the edits I observed would probably be written by PBB in the future, I don't plan on removing the edits. I'll be more careful next time. Torogertu (talk) 04:54, 17 April 2019 (UTC)

another subclass loopEdit

This edit created a subclass loop between cerebellar ataxia (Q154709) and Q21082497. I'm assuming there's a problem with one or more of the identifiers, or is this loop actually in your source reference? ArthurPSmith (talk) 16:08, 20 April 2019 (UTC)

@ArthurPSmith: Hmm, interesting example. It appears that the subclass loop is due to the fact that two data sources Disease Ontology release 2019-04-18 (Q63226230) and Monarch Disease Ontology release 2018-06-29sonu (Q55345445) disagree on the direction of that subclass relationship. I don't have the expertise to judge which is correct, and that's certainly not a call that I'd want our bot to make automatically. Given that Wikidata is a database of assertions and not facts, it seems like we want to allow for capturing this type of disagreement. Of course, I understand that this complicates usage of Wikidata by reasoners. Do you have a suggestion on how this could be better modeled? Or is the reference something that reasoners just need to account for? Best, Andrew Su (talk) 22:46, 22 April 2019 (UTC)
If it's really in the sources that's probably fine - but we may want to notify them about the disagreement. Reasoners will have to deal with stuff like that I guess! ArthurPSmith (talk) 11:33, 23 April 2019 (UTC)

RfC about enzymsEdit

There's currently an RfC about bot created items for enzymes: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Unifying_GO_activities_and_enzyme_articles ChristianKl❫ 07:49, 19 June 2019 (UTC)

duplicated and misplaced GOA determination methodsEdit

The bot added GOA statements with ref having the determination method. The method is duplicated from the statement qualifier (where it belongs and so it's marked as scope violation). Example: https://www.wikidata.org/w/index.php?title=Q27757881&oldid=988373683 . Will you fix these, i.e. remove the doubled method statement everywhere? --SCIdude (talk) 08:31, 29 July 2019 (UTC)

Or better, comment on the proposal at Wikidata_talk:WikiProject_Molecular_biology#"determination_method"_property_on_GOA_references --SCIdude (talk) 13:45, 29 July 2019 (UTC)

NoneEdit

This edit inserted the word "none" in the alias field. Can you adjust the bot to interpret that as something not worth adding? WhatamIdoing (talk) 03:12, 2 August 2019 (UTC)

@WhatamIdoing: thank you for the note. Yes, that clearly would be a good check to add. We will add that in a future release: https://github.com/SuLab/GeneWikiCentral/issues/119. Best, Andrew Su (talk) 03:17, 2 August 2019 (UTC)
Thanks.
You all are great to work with. You're very quick to respond and obviously interested in having a friendly bot in addition to it being useful. Thanks for being you, as well as for fixing this minor problem. WhatamIdoing (talk) 15:44, 2 August 2019 (UTC)
@WhatamIdoing: and thank you for continually providing constructive feedback and advice. We are grateful to all who help us identify bugs both big and small. Team effort! Best, Andrew Su (talk) 18:16, 2 August 2019 (UTC)
Ping. :-) WhatamIdoing (talk) 14:24, 22 August 2019 (UTC)
Thanks you for the ping! Sorry, summer holidays (as well as fixing a couple other issues with our bot automation infrastructure) have kept us from implementing a solution here. But we definitely have not forgotten! More soon... Best, Andrew Su (talk) 17:08, 22 August 2019 (UTC)
Sorry for letting this slide a bit. I have just implemented that proposed solution to ignore words not worth adding. That list currently contains 4 words, being none, None, gene and Gene. Other suggestions are welcome. --Andrawaag (talk) 22:13, 22 August 2019 (UTC)
That sounds like a great starting point. I wondered whether "unknown" might also be a good option, but I didn't find any instances of that in my search (so it would probably be pointless). WhatamIdoing (talk) 17:12, 23 August 2019 (UTC)

Stop specific UniProt statement importsEdit

Please STOP import/update of PDB and GOA statements from UniProt entries with keyword "Cleavage on pair of basic residues [KW-0165]", see Wikidata_talk:WikiProject_Molecular_biology#Problems_with_PDB_and_GOA_from_UniProt_imports. --SCIdude (talk) 09:44, 6 August 2019 (UTC)

As the mentioned keyword seems weakly supported, I found a better property to look for: the existence of "peptide" in the PTM part, the query is annotation:(type:peptide) taxonomy:"Homo sapiens (Human) [9606]" (162). NOTE also that peptides get a unique ID, see https://www.uniprot.org/help/sequence_annotation at the bottom, e.g. https://www.uniprot.org/uniprot/P01019#PRO_0000032458 for angiotensin-2. --SCIdude (talk) 15:44, 7 August 2019 (UTC)

unlocated Entrez genesEdit

There are gene entries from Entrez where only the chromosome position is known, and so no official gene name can be given. It seems Entrez then just took symbols from OMIM, regardless if that was a gene or phenotype entry, and gave that as symbol. Example: TEC (Q26241247) from Entrez 100124696 where the symbol is from OMIM 227050 (Transient erythroblastopenia of childhood) which symbol collides with TEC (Q18031939).

Such items should be marked somehow, maybe genomic start/end ---> unknown ----SCIdude (talk) 07:36, 15 August 2019 (UTC)

Interesting example... I would have thought that Entrez would not have used the same symbol for two genes in the same species. But anyway, my guess is that this case will be pretty rare. And given that the two gene items have different Q-ids (and in turn different Entrez IDs) I don't think there's anything incorrect here. Do you agree? Regarding the suggestion on adding "unknown" for genomic start/end, I wonder if that's necessary and desirable -- we don't proactively note when protein domains or protein interactions are not known, for example... Best, Andrew Su (talk) 20:53, 21 August 2019 (UTC)

Wrong ICD9 property/URL usedEdit

In https://www.wikidata.org/w/index.php?title=Q7170410&diff=566087319&oldid=549930250 and many others (diseases) you added ICD9-CM (having removed pure ICD9 earlier). Now all ICD9s on diseases are ICD9-CM, which point nowhere because their search engine can't find procedures with that code. Please fix! --SCIdude (talk) 08:39, 21 August 2019 (UTC)

@SCIdude: Thanks for the note. I've spent a little bit of time looking into this and just want to jot down my observations. First, the diff you linked above is pretty old (2017) so the bot has undoubtedly changed/improved since then. Links to more recent diffs are always helpful. Second, it looks like the edit is correct in that the Disease Ontology says that the ICD9-CM for persistent fetal circulation syndrome (Q7170410) is 747.83 (ref: https://www.ebi.ac.uk/ols/ontologies/doid/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FDOID_13042). Third, I see that the ICD9-CM link from Wikidata (http://icd9cm.chrisendres.com/index.php?srchtype=procs&srchtext=747.83&Submit=Search&action=search) returns no matching record, but this similar URL (http://icd9cm.chrisendres.com/index.php?action=search&srchtext=747.83) appears to return the right record. Perhaps the formatter URL should be updated? Any thoughts here? Best, Andrew Su (talk) 20:42, 21 August 2019 (UTC)
@Andrew Su: Agree, it's not the wrong type of code but the URL format. If changing the formatter suffices for that then please do. Could you please also change the URL format for HGNC gene symbol (P353) which at the moment results in 404 as well: example INS gives https://www.genenames.org/data/hgnc_data.php?match=INS (my personal opinion is that the HGNC ID already gives the exact match so I think the whole property is irrelevant). --SCIdude (talk) 06:13, 22 August 2019 (UTC)
Great, I updated both formatters. Similar to your thought, I think the formatter URL for HGNC gene symbol (P353) still isn't perfect/useful (though I do think having the property/statement itself is useful for searching). Best, Andrew Su (talk) 18:11, 22 August 2019 (UTC)

MeSH descriptor ID (P486) and Disease OntologyEdit

Please note that Disease Ontology is not a reliable source for updates of MeSH descriptor ID (P486). I'll explain shortly why that may be, but the problem seems endemic at DO.

The number of database constraint violations logged at Wikidata:Database reports/Constraint violations/P486#Unique value is large, in the thousands. The "distinct value" constraint violations interfere with the work I do with the NCBI2wikidata tool (see WD:SS), which looks up topics here by their P486 value. So I have been working to remove those issues; the August 7 run by ProteinBoxBot based on the August 6 DO update wrote over some of that work.

While no doubt some of the problems are caused randomly by humans, there is a major and systematic series of errors being caused by bots. I found that the MeSH identifier properly on meningioma (Q369157) was on 17 (seventeen) items. You can see why that might come about on https://www.ncbi.nlm.nih.gov/mesh/68008579. There is a long list called "Entry Terms". These are not synonyms for "meningioma", in nearly all cases. I think they work in the MeSH system roughly like redirects; aliases for them perhaps, but certainly not for Wikidata. Effectively they run over subclasses of meningiomas.

The use of external databases to reference identifications of MeSH IDs with items here is unconvincing, in the cases I have looked into. DO seems the worst, but they perhaps all have the same problem, not discriminating well for the subclasses. DO may have a problem discriminating "neoplasm" from "cancer", for example, which is serious considering that MeSH for oncology pivots on neoplasms for its broad terms in this area.

As illustration: right now the MeSH ID D020528 is on primary progressive multiple sclerosis (Q18553470), progressive relapsing multiple sclerosis (Q18553471), secondary progressive multiple sclerosis (Q18965511) and chronic progressive multiple sclerosis (Q18971609). The MeSH page https://www.ncbi.nlm.nih.gov/mesh/?term=D020528 is for "Multiple Sclerosis, Chronic Progressive", and so the ID belongs on chronic progressive multiple sclerosis (Q18971609). It is hard not to attribute the duplications to ill-informed use of the entry terms there: "close match" is going to cause a database violation, so should be ruled out.

I removed the ID from primary progressive multiple sclerosis (Q18553470) with this diff on August 7. It was replaced a couple of hours later with this diff by ProteinBoxBot. I only noticed this because I have begun some systematic record keeping and now track error messages in NCBI2wikidata.

I honestly think the bot edits "referencing" MeSH identifications add little value. As you can see on chronic progressive multiple sclerosis (Q18971609), I'm adding the MeSH string itself with the MeSH descriptor ID (P486) statement, in the hope that the kinds of errors that have proliferated will be made very much more obvious in future, to humans, just by comparing with the item label.

My wish is to remove the thousands of database constraint violations for MeSH descriptor ID (P486), since they are affecting progress in getting main subject (P921) statements from a PubMed API. If the MeSH updates from ProteinBoxBot were simply halted for the present, that would be a great help.

Charles Matthews (talk) 18:28, 21 August 2019 (UTC)

OMIM phenotype IDs are on subclasses as well. The correct solution would be to have one "exact match" to indicate just that, the rest are implicitly inexact. --SCIdude (talk) 20:35, 21 August 2019 (UTC)
@Andrew Su: what would be wrong with ONLY having exact matches on an item, except when there are no exact matches at all? --SCIdude (talk) 07:03, 22 August 2019 (UTC)
I think only including exact matches would be a great solution, but I don't believe that the Disease Ontology differentiates exact matches from less precise matches (unfortunately, generic "xrefs" are commonly used in biomedical ontologies). I've pinged User:Lschriml to ask her to chime in here. Note also that we tried to set up a method to indicate exact/narrow/broad match (as demonstrated here https://www.wikidata.org/wiki/Q147778#P492) when/if the data sources specify... Best, Andrew Su (talk) 18:19, 22 August 2019 (UTC)
User:Lschriml - for Disease Ontology disease term cross references to other clinical vocabularies such as MeSH we strive to specifically include only exact matches. However, the DO and the other clinical vocabularies do not always split or clump disease terms in the same way, therefore, there are instances where a DO disease term would map to a more generic term in MeSH. The links we annotate in the DO file are assessed, so that we can provide the best match between vocabularies.

Thanks for the responses. I think, in the big picture, there is aggregation of data, and there is curation. The latter needs to be by hand, and for the MeSH descriptor ID I believe the point has been reached where what ProteinBoxBot is doing is, at best, showing diminishing returns.

I'm now engaged in a drive to remove the MeSH descriptor ID (P486) constraint violations, which is necessary drudgery. It is not possible for me to undo a ProteinBoxBot update edit partially, since it is added all-or-nothing. The curation logic would be to drop the P486 editing from the runs. I would say that this approach is just a reflection of the maturity of Wikidata in this area: it is now within reach to have the whole set of the MeSH D-numbers matched 1-to-1 into Wikidata, which opens up downstream uses, such as the one I'm engaged in.

That hiatus would not be the end of the story, since MeSH has annual updates, and also retires some of its codes as obsolete. But it is what should be considered at this point, in my view. Charles Matthews (talk) 10:31, 28 August 2019 (UTC)

The ProteinBoxBot run of today has written over my maintenance work of yesterday in places, so I have undone some edits. Charles Matthews (talk) 12:49, 28 August 2019 (UTC)

@Charles Matthews: Thanks for the follow up. A few points from my perspective. First, the ProteinBoxBot will continue to overwrite your changes on MeSH descriptor ID (P486). It in general leaves human edits alone, but not for things we consider core IDs it does not. So I suggest holding off further manual edits until we get the bot behavior squared away, which I think we can do quickly. Second, I definitely see how the behavior and examples above are less than ideal. In general I subscribe to the thought that Wikidata is a collection of statements without judgement on the validity of those statements (see for example https://www.wikidata.org/wiki/Q2#P1419). However, in this case with these identifiers I can see how they might disrupt your downstream work with WD:SS. (Personally, I view this as a data issue and not a bot issue, but this is a quibble...) Third, I'm trying to think up the ideal long-term situation here, especially for the cases that Lschriml describes where there is no exact match between vocabularies due to different splitting/clumping criteria. If we were able to get the DO team to annotate those mappings with explicit qualifiers for narrow match (Q39893967), broad match (Q39894595), and exact match (Q39893449) similar to how it's shown at https://www.wikidata.org/wiki/Q147778#P492 would you be able to effectively filter for the exact matches in your application? Fourth, I think your proposed solution of having ProteinBoxBot ignore MeSH descriptor ID (P486) at least for the short term is reasonable -- let me just check with the rest of the team just to make sure there aren't any other impacts I'm missing. EDIT: Fifth, another thought just occurred to me. Is it possible for you to filter based on the reference, to essentially ignore mappings that come from the Disease Ontology? That is another possible solution that follows my point #2, where the bot is accurately reporting a statement from a source... Not saying this is the best course, but just want to discuss the range of possibilities. Best, Andrew Su (talk) 13:24, 28 August 2019 (UTC)

(edit conflict) @Andrew Su: No, I don't think I can just "hold off": I don't see why the project I'm involved in, to populate items about scientific papers with main subject (P921), should not be given precedence.

Please see at Wikidata:Bots#Bot requirements "Monitor constraint violation reports for possible errors generated or propagated by your bot". I asked for a halt, having given details of the issue. Charles Matthews (talk) 13:46, 28 August 2019 (UTC)

@Charles Matthews: Sorry, I think my main message may have gotten lost in my last lengthy reply -- the main message being that I think our efforts are highly aligned, and that I am mostly agreeing that ProteinBoxBot should defer to you for MeSH descriptor ID (P486). After consulting our internal team, I don't think there are any issues on our end, so we will go ahead and make that change.
I think that's a reasonable short-term solution, but I still would like to discuss what the best long-term solution should be. (For example, as I think you allude to, there is value in a bot keeping up with updates over time.) Can you comment on two ideas I mentioned above? In my third point above, I asked whether your processes would be able to use qualifiers that distinguished exact from non-exact matches (if they were widely used, which they currently aren't). And in my fifth point above, I similarly asked whether you could filter statements based on the reference. Can you share your thoughts please? Best, Andrew Su (talk) 15:43, 29 August 2019 (UTC)

@Andrew Su: OK, then. In the most general terms, I subscribe to m:Knowledge Integrity, which is a WMF program involving Wikidata. That would be in tension with "the thought that Wikidata is a collection of statements without judgement on the validity of those statements", given the aspiration there on meta as "Our 5-year vision for the Knowledge Integrity program is to establish Wikimedia as the hub of a federated, trusted knowledge ecosystem".

You may be vaguely familiar with ScienceSource, which began in 2018 as a ContentMine project not long after the Knowledge Integrity Program was launched. The automation of en:WP:MEDRS via metadata held (mostly) here about biomedical articles is the headline offer in the ScienceSource project: in any case it is not hard to see the relationship between reliable medical sources and the aspiration quoted.

Now this is all a bit of a way from the topic of my post to this page, it may appear. In its first year, with a WMF grant, ScienceSource did proof of concept of some things, in particular adding main subject (P921) statements to items here about biomedical articles. Those statements are about diseases, so you will understand the nature of the problem, and also that "neglected diseases" are treated somewhat differently by MEDRS. In any case since ScienceSource also involves text-mining, as is appropriate for a ContentMine project, you will probably also understand that text-mining neurology and oncology differently also makes sense.

So here's where MeSH comes in. The topical information is drawn into Wikidata from an API on PubMed: a batch of search terms is fed into the API, and our NCBI2wikidata tool transforms the results into bot code, a buffering necessary really because they arrive much too fast to be posted here in real time. Since the PubMed API knows nothing about the Q-numbers of topics, the subtle point in this tool is the lookup from the Mesh ID D014435 that gives you typhoid fever (Q83319).

I'm not the tool's author, I should make clear. But here's the thing: if there are two or more items on Wikidata with Mesh ID D014435 on them, the code will make an arbitrary decision, and what results as a main subject will be wrong at least as often as it is right, on average. In most of the incorrect cases that happen, it is a "fuzzy" match to the real thing, and that does not undermine the applications made in the first 12 months of the project.

But taking "knowledge integrity" as an ethic, we end up with unsatisfactory incorrect statements saying something like "PubMed states that this paper has as topic incontinentia pigmenti achromians" when that should be "pigmentation disorders", to take an example I picked up on today. Clearly this approach can and should be scaled up – there are now 22 million items here on articles, a very high proportion of those being biomedical – and should be applied to the whole range of topics, taxons and substances and genes and so on, not just diseases. There are 26K MeSH terms.

With this long preamble, here is a summary of your previous comments and my reactions:

  1. "Wikidata is a collection of statements without judgement on the validity of those statements". To the extent that I'm proposing to take MeSH topics on the PubMed abstract pages (say the main, starred ones without the /qualifiers, to be reasonable) as authoritative, I'm agreeing with that approach.
  2. "I view this as a data issue." Agreed, but in the good case that the 26K MeSH descriptor ID (P486) statements are at some future point correctly assigned to different items here, with matches acceptably exact, there are still the curation issues of reverting unwanted changes (could be done with a watchlist) and patrolling for unwanted duplications (violation of the "distinct values" constraint). Plus periodic, planned updates. But those at times when NCBI2wikidata or successor tools can work round them. I do think https://id.nlm.nih.gov/mesh/query is the authoritative source in this area.
  3. "...qualifiers that distinguished exact from non-exact matches." In the context of statements using a property that has a "distinct value" constraint, close matching is acceptable if, and only if, there is no exact match present for the value in question. This can be picked up by a query.
  4. "...having ProteinBoxBot ignore MeSH descriptor ID (P486) at least for the short term is reasonable". Thank you for your understanding on this matter.
  5. "Is it possible for you to filter based on the reference, to essentially ignore mappings that come from the Disease Ontology?" Not really. An entry such as http://disease-ontology.org/term/DOID%3A461/ is just seriously wrong. Read at https://meshb.nlm.nih.gov/record/ui?ui=D009379 where it says "Neoplasms composed of muscle tissue: skeletal, cardiac, or smooth. The concept does not refer to neoplasms located in muscles", and at https://meshb.nlm.nih.gov/record/ui?ui=D019042, "do not confuse with NEOPLASMS, MUSCLE TISSUE". Well, they confused those things. I think you should be concerned, because, as I pointed out before, the bot operator is required to "Monitor constraint violation reports for possible errors generated or propagated by your bot". My underlining. This is the issue that brought me here, and it couldn't be clearer where the onus lies. The bot operator is required to make judgements as in #1, and the buck stops there.

I hope that helps.

Charles Matthews (talk) 17:37, 29 August 2019 (UTC)

@Charles Matthews: We have halted our bot for now, to clarify and do some updates. However, we need to state that this does not imply that we will not resume with the previous practice. I disagree with the assessment that our bots "generated or propagated errors". We are dealing with a complex issue here that boils down to a disagreement between resources. As Andrew already stated we too are uncomfortable with the excessive use of xrefs to map relations. We too would prefer if primary resources would be more explicit in the nature of the relationship. However, as a cross-reference, they indicate similarity between resources. So yes if multiple DO terms state a cross-reference to MeSH, it makes sense to set a Wikidata item reflecting that disease with the appropriate MeSH IDs, even if that means breaking some constraint violations. In this aspect, it is maybe worth noting a similar issue with Gene IDs. There is a similar issue with respect to Gene identifiers. Both Ensembl and NCBI gene are authoritative resources on gene identifiers. Sourcing both to Wikidata will lead to a similar (but smaller in size) issue as the one you are raising here. It is not our position (nor yours) to assume the preference of a disease resource over another.
As with the issue with the Gene ID mappings, it is not always possible to respect constraint violations reports. As long as primary resources prefer to state mappings with a simple cross-reference we have to introduce many one-to-many relationships.
As said I have halted the DO bot for now. I am currently creating EntitySchemas in the new schema extension of Wikidata, which is a better way to deal with these complex mappings. With constraint violations, it is not possible to say that a constraint violation is no longer a violation if the statement is sourced from different primary sources - which by design is possible in Wikidata. This means that to fetch real constraint violations we have to rely on SPARQL, which due to the complexity of the query needed, often times out. With the new schema extension, this becomes easier to deal with. So once the schemas are ready I will share them with you, so that when we resume our bots, each downstream application can use those to pick its preferred resources. --Andrawaag (talk) 17:17, 9 September 2019 (UTC)

Well, I have to say I don't fully understand what you are saying there, but what is in Wikidata should be fit for downstream use. I have given a few case studies of the issues above. It would be easy to give more: these are just things that turned up as I was going about removing the constraint violations.

The following is what should happen now: (i) the constraint violations, at least for the D-numbers, should be completely fixed, except for a few cases I have noted at Property_talk:P486#Allowable constraint violations. (ii) Any future bot runs should be examined for any constraint violations introduced. This is in order to allow proper feedback to be given to the upstream database. I think this is normal: if there appear to be problems with the database content, they should be raised. (iii) Since the constraints are in effect imposed at the time of the property creation, any serious discussions in principle of what to do about them should be raised with the community, since property creation is a community matter. (iv) I am not involved in the bot approval process here, but I have cited a principle from the bot approval page. If that principle, which seems to me to be clear enough, is to be debated, then it should be debated in the appropriate forum.

I do find the technical side rather overdone here, since MeSH terms are defined in most cases by clearly-drafted scope notes. The main issue in resolving violations of the "distinct value" type is which of two or more items is correctly and matched to the MeSH term. In quite a number of cases I have met, the answer is "neither", and the appropriate action is to create a new item. As I said above, there is the chance of matching MeSH D-number terms 1-1 into Wikidata, and that is my objective at this point. The disease mix'n'match catalog for MeSH was actually matched about a year ago, by a group of people that included me working as a Wikimedian in Residence. That work needs checking, but I find it surprising that the DO work apparently has made no allowance for it.

I'm actually hoping to have made substantial further progress with MeSH completion by the time of WikidataCon. It is not the disease part that needs most attention, as I say, just to stabilise in particular the oncology items where a great deal of confusion had been introduced. See the comment on the property talk page from 2018 about that. Charles Matthews (talk) 20:13, 9 September 2019 (UTC)

ncRNA items marked as geneEdit

The Entrez docs at https://www.ncbi.nlm.nih.gov/books/NBK3841/#EntrezGene.Properties specifically differentiate between gene and RNA entries: entries with "gene type"--->"ncRNA" are RNA entries. Nevertheless your import of them added "subclass of"--->"gene", example diff. I know these are old, and the docs are not easily found. Granted the Entrez ID is actually about the gene, but the item is not. Please make sure this confusion won't happen again: there should be two items created with "encodes/encoded by" set, of course. --SCIdude (talk) 09:30, 29 August 2019 (UTC)

@SCIdude: Sorry for the slow responses -- just getting caught up here again. As you noted, this was an old edit, and after reviewing this issue, we *think* that this is not an issue with the current version of the ProteinBoxBot. (and ultimately, it may tie into a more fundamental data modeling issue of how we represent genes and RNA transcripts.) But thanks for the note, and please let us know if you notice any further issues... Best, Andrew Su (talk) 17:55, 6 September 2019 (UTC)

what to with UniProt obsoletionsEdit

There are 27,205 right now, and I'm marking them as inst-of protein obsoleted in UniProtKB, SwissProt, TrEMBL (Q66826848). Later I'll add the reference to the link and remove the UniProt protein ID (P352) but do the items have any justification for their existence, maybe as failed hypothesis? Do the statements that refer to the UniProt have any justification? --SCIdude (talk) 09:37, 1 September 2019 (UTC)

@SCIdude: Thanks for mentioning this. I have opened a ticket to follow progress. It is a good question on what to do with these obsolete records. My initial response is similar to identical to yours, i.e. to remove the affected statements (UniProt protein ID (P352). If they are not being used, meaning that there are no links to or from other Wikidata items, I think that is what we should do. I will create a bot to do just that, removing P352. If the Uniprot ID is the only identifier of the item, I would argue to nominate for deletion of the item. Without any identifier (or Wikimedia sitelink), by removing (UniProt protein ID (P352), the item basically becomes a orphan. Wikidata is a bit slow to unresponsive now, where I am, due to an apparent DDoS attack, which makes it bit difficult to investigate these cases. But I will follow up shortly. The good thing is that the obsolete records are available through Uniprots SPARQL endpoint, which makes it easier to fix this automatically. --Andrawaag (talk) 20:35, 6 September 2019 (UTC)
That leaves >22k items without UniProt IDs, visible with this query:
 ?p wdt:P31 wd:Q66826848 .
 MINUS { ?p wdt:P352 [] }

Around 12,800 of them still have RefSeq Protein IDs but AFAIK RefSeq completely refers to UniProt, right? --SCIdude (talk) 06:12, 7 September 2019 (UTC)

exact xrefs from GOEdit

@Andrew Su: @Andrawaag: As to mapping relation types I have made a request to the GO folks at https://github.com/geneontology/go-ontology/issues/17892. Maybe that would help you too? --SCIdude (talk) 07:01, 24 September 2019 (UTC)

Thanks! Subscribed to see where that thread goes... The GO team's reaction will be a good gauge, though arguably we'd also want to get feedback from the broader biomedical ontology community, perhaps at https://github.com/OBOFoundry/OBOFoundry.github.io/issues. Best, Andrew Su (talk) 04:44, 25 September 2019 (UTC)

syncing of GO annotationsEdit

Is it possible the bot does not remove obsolete GO statements? On BCL2 related protein A1 (Q21100281) I just deprecated "channel activity" which, according to UniProt history, was removed already end of 2018. --SCIdude (talk) 14:41, 5 October 2019 (UTC)

Another example is Solute carrier family 66 member 1 like (Q21102153) where UniProt removed all GO annotations in March 2017, but we still have them. --SCIdude (talk) 13:24, 6 October 2019 (UTC)

Good questions, and thanks for the examples. Let us check on this. As you can imagine, deleting content is something we are *very* cautious about with our bots for fear of overwriting human edits, and conclusively figuring out whether a bot or a human last touched a statement is not trivial. Anyway, more soon... Best, Andrew Su (talk) 18:57, 9 October 2019 (UTC)
In general yes. But does not GO have the authority over any GO annotations? Maybe this should be specified in the three properties (function/process/component). --SCIdude (talk) 09:00, 10 October 2019 (UTC)

Genetics Home Reference (GHR) Conditions URLsEdit

The Genetics Home Reference is a good source of information for gene and disease information. I started a property proposal in order to address the of Genetic Home Reference (GHR) Conditions URLs to Wikidata. If you have suggestions on how to improve the property, please chime in. Gtsulab (talk) 18:09, 16 October 2019 (UTC)

STOP adding ChEBI substance ids to GO itemsEdit

@Andrew Su: @Andrawaag: Please stop adding ChEBI ids to GO items. Not only are they misplaced, they create duplicate conflicts. It's simple to instead link the substance item the ChEBI is referring to---I added the missing 2k items just the last week, and I'll start linking to them from GO items tomorrow. Just STOP adding ChEBI ids.

Example: https://www.wikidata.org/w/index.php?title=Q21469633&diff=1050531538&oldid=1038745167 --SCIdude (talk) 14:51, 15 November 2019 (UTC)

@SCIdude: We have stopped adding ChEBI ids. Also, deactivated the bot, until we have found a better sollution. Unfortunatly, the issue is not limited to ChEBI alone. For mappings we have been relying on the property dbXREF which by many sources is used to capture mappings with other resources, but the nature of the mapping - is the relation synonymous, hyponomous, or hypernomous - is implicit and not clear. Currently, Eitherway gene ontology bot is down, until we have found a better sollution to handle mappings.
Could you elaborate a bit on the duplicate conflicts you mention?
Thank you for your understanding --Andrawaag (talk) 21:03, 20 November 2019 (UTC)
The duplication is with the respective chemical compound item which has the ChEBI id (unique value constraint violated). --SCIdude (talk) 06:22, 21 November 2019 (UTC)
This is indeed an issue, which we need to discuss as well. It is not the first time that we run in these "unique value constraint violations". We agree, the addition of the ChEBI mappings is wrong, but there is a more systemic issue that makes this happen. That is how resource map identifiers. Our bot added this mapping because it actually is in the sourced resource using a property often also used to map identifiers. Again, we agree that this is wrong and that is why we stopped our bot from running. However, the more resources will be covered on wikidata the more similar unique value constraint violations will emerge. Two things needs fixing, 1. resources should be more explicit on the meaning of their identifier mappings, rdf:seeAlso and owl:dbXRef is just not sufficient. At the same time should constraint violations consider references, which it currently doesn't. One of the beauties of Wikidata is that issues like this emerge. i.e. inconsistencies between resources become apparent and actionable. For this reason, I would actually prefer if we continue adding incorrect mappings to Wikidata if this reflects mappings that exist, provided the source and the mapping property (eg. dbxref, seealso, skos mapping relations, etc) are added as qualifiers. In the long run this will benefit the overall quality, but if the constraint violations don't consider references nor qualifiers this will lead to more constraint violations that can only be fixed by indeed not adding them, making the issue persist. While we should actually welcome them as means to show disagreement between soources or inconsistencies in how certain assertions are made. --Andrawaag (talk) 09:00, 21 November 2019 (UTC)
I would agree but not in this case (ChEBI Id on a GO item) because a ChEBI entry is about a chemical entity, and if you want to link to that from a GO item, just link to the chemical item. I have done that already for the whole GO, so you don't need to anymore. --SCIdude (talk) 07:20, 11 December 2019 (UTC)

Where do changes come from? (Why a particular change?)Edit

I'm trying to figure out the 'status' of a particular name. Right now the world consensus seems quite split between "Burkitt's lymphoma" and "Burkitt lymphoma". So at some apparently random point the information here changed without any attribution. That doesn't help. Can you speculate _where_ the justification came from? Shenme (talk) 08:12, 16 November 2019 (UTC)

@Shenme: Our bot sources the disease ontology. Looking at that primary record on Burkitt lymphona. It shows that the name without the ', is the first name, where the others are stored as valid synonyms. You can see that "Burkitt's lymphoma", is also mentioned as alias. --Andrawaag (talk) 21:21, 20 November 2019 (UTC)

GO "synonym"Edit

The alias field should only contain exact synonyms, I hope we agree. GO's synonym field also contains NARROW, BROAD, and RELATED ones, besides EXACT. Unfortunately the bot adds all of them, also Wikipedia people added more unrelated stuff. I will now start purging all GO items of any alias that is not an EXACT GO synonym. But I ask you to please STOP re-adding any inexact ones again. --SCIdude (talk) 07:15, 11 December 2019 (UTC)

@SCIdude: Good suggestion. I've created a ticket here. Best, Andrew Su (talk) 01:00, 12 December 2019 (UTC)

Germination Pore - Delete?Edit

I am not sure about the procedure and side effects so I am asking here: can Q22330478 (germination pore) please be deleted as it is a duplicate of Q5550964 (germ pore)? It was copied with ProteinBoxBot from Gene Ontology but the entry there has been made obsolete later; interestingly, reason for making it obsolete was the existence of the English Wikipedia entry relating to Q5550964. Thank you for the help and patience with an inexperienced user! --Matthias.Wolf (talk) 14:17, 29 December 2019 (UTC)

Thanks. I have made the necessary merge and edits. --SCIdude (talk) 07:33, 30 December 2019 (UTC)

InterProEdit

Just a heads up, I'm preparing to do the next InterPro update and, after that, will be splitting all InterPro domain items into domain item + associated protein family item, as they are different concepts mixed by InterPro. It would also clarify that proteins *have* domains, and are *part of* domain families. While this seems unnecessary duplication, it allows inclusion of these families in the existing family tree. --SCIdude (talk) 08:31, 8 January 2020 (UTC)

Bot fightEdit

My bot and ProteinBoxBot seem to be edit warring over an item: https://www.wikidata.org/w/index.php?title=Q28371275&action=history. My bot fixes a redirect but then it is reintroduced on each update. --Matěj Suchánek (talk) 15:26, 24 February 2020 (UTC)

@Matěj Suchánek: Thanks for reaching out. Our bot indeeds seems to need some updating on QIDs used. I am a bit surprised, because this suggests that the bots uses hard-coded QIDs, or that the source uses outdate QID mappings. I am currently with limitied bandwidth, but will fix this this week, I have temporarily paused the bot, until I have fixed this. --Andrawaag (talk) 18:30, 24 February 2020 (UTC)
OK. This is the only item I found, so if these updates only happen once a month and you are going to fix it soon, then perhaps it wasn't even necessary. Thanks anyway. --Matěj Suchánek (talk) 09:35, 25 February 2020 (UTC)