Logo of Wikidata

Welcome to Wikidata, Charles Matthews!

Wikidata is a free knowledge base that you can edit! It can be read and edited by humans and machines alike and you can go to any item page now and add to this ever-growing database!

Need some help getting started? Here are some pages you can familiarize yourself with:

  • Introduction – An introduction to the project.
  • Wikidata tours – Interactive tutorials to show you how Wikidata works.
  • Community portal – The portal for community members.
  • User options – including the 'Babel' extension, to set your language preferences.
  • Contents – The main help page for editing and using the site.
  • Project chat – Discussions about the project.
  • Tools – A collection of user-developed tools to allow for easier completion of some tasks.

Please remember to sign your messages on talk pages by typing four tildes (~~~~); this will automatically insert your username and the date.

If you have any questions, don't hesitate to ask on Project chat. If you want to try out editing, you can use the sandbox to try. Once again, welcome, and I hope you quickly feel comfortable here, and become an active editor for Wikidata.

Best regards!

Bináris (talk)

Hello there! I am w:User:Charles Matthews. Charles Matthews (talk) 21:05, 16 February 2013 (UTC)Reply

ODNB references edit

Hi CM, I mentioned these back in April at w:Template talk:Cite ODNB/Archive1#Subscription, that the references (and Archival material) cited by ODNB can be accessed by the URL http://www.oxforddnb.com/view/references/ODNBid. Reading the blog by Andrew, I was wondering if you have thought of somehow accessing the ODNB references by Wikidata. Solomon7968 (talk) 06:54, 1 December 2014 (UTC)Reply

Yes, I recall the discussion. Andrew and I met the ODNB folk on Tuesday, at a party for the tenth anniversary of the ODNB; amd we have cooperation with them on metadata. So that's a line I could take up with Andrew. Charles Matthews (talk) 06:58, 1 December 2014 (UTC)Reply
Great! However I have noticed that the ODNB doesn't list the book reference author(s) in full names and the name is often difficult to figure out, see wikisource:User talk:Billinghurst#Full names regarding this but that's a separate story. On the metadata front what do they say about your filled-in references idea? Solomon7968 (talk) 07:26, 1 December 2014 (UTC)Reply
Yes, it's not so clear to me what can be done. Charles Matthews (talk) 07:27, 1 December 2014 (UTC)Reply

You are a composer edit

Hi Charles, I was Listening to Wikidata this morning, and amongst the profusion of bot edits, I saw you were contributing to the medley!Fabian Tompsett (WMUK) (talk) 09:13, 16 January 2015 (UTC)Reply

Germanic umlaut: Gerstäcker, Friedrich edit

Is this edit an old / known problem? Cheers --Kolja21 (talk) 11:55, 6 April 2015 (UTC)Reply

Yes, there are many problems with special characters in the Appleton's catalog (on the mix'n'match tool). Charles Matthews (talk) 16:59, 6 April 2015 (UTC)Reply

Fellow of the Royal Society ID (P2070) edit

Fellow of the Royal Society ID (P2070) is ready. --Tobias1984 (talk) 18:00, 14 September 2015 (UTC)Reply

Wrong merges edit

All three mergesd of the Thai administrative units you did earlier today I had to revert, because they were wrong - there are same-named units at different administrative levels, which are NOT the same. Please do not merge when you are not sure about it. Ahoerstemeier (talk) 10:40, 15 October 2015 (UTC)Reply

Thank you for the information. Charles Matthews (talk) 10:48, 15 October 2015 (UTC)Reply

WikiData query edit

Hi Charles, We met at the recent wikidata training in London, and i have been inspired to to hold a Wikidata event in the coming weeks. The aim will be to improve the DWB wikidata using this query The page suggest that the Data should be edited in Autolist 2. Can I just confirm that it is ok for our volunteers to work off the above page?

As for Autolist2: the comment is perhaps a little misleading. Autolist2 is useful for adding a given statement to a whole list of items. It is perfectly fine to go into items and edit them from Autolist1, of course.
For instance, and I did this for the ODNB genders myself, suppose you go through and find all the women, and mark them as female. Then when you run the query again, it will be all male, given that CLAIM[31:5] guards against families and suchlike.
So to finish the job you can run it in Autolist2, and tell it to add "sex or gender = male" to all of them, a big timesaver.
Good luck with it all, and glad to help. Charles Matthews (talk) 11:51, 9 November 2015 (UTC)Reply
Great, Thank you Charles. Jason.nlw (talk) 12:43, 9 November 2015 (UTC)Reply

DNB property proposal edit

If you are in favor of the property, would you support it explicitly? --- Jura 09:34, 17 November 2015 (UTC)Reply

Well, OK. It would help me, but I wasn't sure it was best possible. Charles Matthews (talk) 09:41, 17 November 2015 (UTC)Reply
I suppose it depends what you want to do with it. To just let it rest there, a complex version of P1433 might be the better solution. For any practical uses, personally, I think a separate property works better. --- Jura 11:26, 17 November 2015 (UTC)Reply
I have a general theory that "described by source" will become important for Wikisource. If W is some kind of collective work, such as the DNB, then a separate property for W does make it easy to call up the scope of W, i.e. the items here that form the main subjects of articles in W. The data items attached to the articles here make it possible to add the authors there, and so to reconstruct the author list, and (author, article) pairs of items. These are the major applications I see of Wikidata to collective works on Wikisource. So since the separate property is positive rather than negative for these applications, I have no objection. I did think others might see other issues. Charles Matthews (talk) 12:42, 17 November 2015 (UTC)Reply
I hadn't seen it that way, but it makes sense. You could do that same with to distinct properties, but I see your approach. Given that it's mainly Billinghurst and yourself who work with it, I'd go with the solution you prefer. If you do want to test, maybe a smaller work would be the better. --- Jura 15:10, 17 November 2015 (UTC)Reply
For what I am doing now, which is hunting ODNB items which have not yet been tagged (because the initial mix'n'match list wasn't complete), it would be easier to have the value in "described by source" simply the DNB edition; and the Wikisource page could become the reference URL. This has the advantage of finding the DNB items rather explicitly. But then "described by source" still isn't the opposite relation to "main subject". Charles Matthews (talk) 15:17, 17 November 2015 (UTC)Reply
Now that they are trying to make all identifiers into URLs, maybe an approach for Wikisource should be looked into as well. I added a few more lists here. In terms of queries, I think it's equivalent. It's just that you can't run the daily constraint reports on it (compared to a separate property). I try to finish importing main subjects from WS. Once done, maybe we can convince Magnus to do a Mix-and-Match between Wikisource and Wikidata for DNB. --- Jura 15:57, 17 November 2015 (UTC)Reply
We talked about that idea, some time ago. I know there are some easy tasks for main subjects: the cases for DNB01 and DNB12 where there is a link to enWP. Magnus did DNB00 only (20K+) to get it started. Matching is otherwise quite hard work. Thanks for your help. Charles Matthews (talk) 16:05, 17 November 2015 (UTC)Reply

When adding multiple VIAFs edit

Hi CM. When adding multiple VIAF identifiers as per Kyrle Bellew (Q5569477) it would be helpful if you could assign one of the VIAFs as the preferred rank [the top of the little boxes] (I usually choose the lower number). You will see that I have done it in this case. This will allow WPs/WSs to have the AC templates to be populated properly rather than confused in their presentation. To note that same applies for images where a ranking is also useful. Thanks. Hope that you are well.  — billinghurst sDrewth 01:58, 4 January 2016 (UTC)Reply

Thanks, I'll bear it in mind. Just about getting over Xmas here. Charles Matthews (talk) 08:07, 4 January 2016 (UTC)Reply

Contributor addition, it is meant to flow from a work listing the contributors edit

Hi CM. I have previously done some of the contributor = DNB as you did at [1]. I was told that I had it arse about. That addition could only be done to the DNB item itself and there we would list the contributors, and do we do that per volume, or per series. There is currently no reverse property to tie a person to where they contributed, and I gave up and didn't propose it. Let us not talk about the difficulties of listing all contributors to Scientific American.  — billinghurst sDrewth 07:01, 14 July 2016 (UTC)Reply

OK, this was driven by discussion at w:Wikipedia talk:WikiProject Dictionary of National Biography. It seemed a reasonable small project to me, and has proved useful so far, as I said there. In fact if you look at w:List of contributors to the Dictionary of National Biography, you can see we could have a better Listeria-generated page now. And also see w:User talk:Rich Farmbrough#List of contributors to the Dictionary of National Biography. I would like to treat this on the basis that Wikidata guidelines are not yet codified. Charles Matthews (talk) 07:10, 14 July 2016 (UTC)Reply

Property proposals edit

Please note [2] - your edit broke an existing proposal; please paste your code into a new page for each proposal (so that they are on your watchlist). Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:47, 25 July 2016 (UTC)Reply

Edward Nicholas birth and death dates edit

In recent edits you assert that the birth and death dates for Edward Nicholas are stated in the Oxford Dictionary of National Biography, and that the dates are in the Gregorian calendar. I find this highly unlikely. The United Kingdom did not adopt the Gregorian calendar until 1752, more specifically, Wednesday 2 September 1752 was followed by Thursday 14 September 1752. Articles about the United Kingdom, especially if published in the United Kingdom, normally use the calendar that was in force at the time of the event, so it is highly probable that the dates of Edward Nicholas' birth and death are stated in the Julian calendar in the Oxford Dictionary of National Biography. Please check your information. Jc3s5h (talk) 16:55, 3 September 2016 (UTC)Reply

Yes, Julian is likely. I have not documented the ODNB's editorial policy on dates: it is probable that the default is Julian to 1752, as you suggest. The default here, unfortunately, is Gregorian flagged up, which I don't like. Charles Matthews (talk) 16:58, 3 September 2016 (UTC)Reply
I suggest you determine the ODNB's policy. If it is not explicitly stated, you could compare birth dates in the suspect range with another source, such as American National Biography, which explicitly states on pages xxi to xxii that they use Julian when that calendar was in force, but treat January 1 as the beginning of the year even when England and Wales treated March 25 as the beginning of the year. It is your duty as an editor to either determine the correct information, or undo all your edits in the suspect range. Although Gregorian is the default calendar, the user interface allows the default to be overridden manually. Jc3s5h (talk) 18:54, 3 September 2016 (UTC)Reply
I am in a position to enquire of the ODNB editorial staff what the exact position is, as you suggest, and as I had in mind: as far as I can see it is not in the Help page they offer, nor is their convention on the Old Style year start. I assume their authors are asked to conform to a house style, but that is only my assumption. I am aware of the override. (I think you can omit telling volunteers their duty, as a matter of wiki etiquette.) Charles Matthews (talk) 19:08, 3 September 2016 (UTC)Reply
So I have been sent the relevant section of the style manual for the ODNB. It does support the idea that pre-1752 dates are Julian, post-1752 Gregorian. There is a caveat about dates given for non-British events, which may use the local calendar.
This then leaves a maintenance problem. My idea would be to master how to use a SPARQL query to pull out dates marked "Gregorian", and apply it with some side conditions. I'll ask at Project Chat. Charles Matthews (talk) 09:23, 14 September 2016 (UTC)Reply

Tidying the ODNB import edit

Hi Charles,

I've been wondering for a while how best to tidy up the ODNB imports which have the "parent item" description attached to the "child item" labels. Two queries that seem useful:

  • SELECT DISTINCT ?description1 ?item1 ?item1Label ?item2 ?item2Label
    {
    	{	SELECT DISTINCT ?item1 ?description1 ?item2
    		{
    			?item1 wdt:P1415 ?whatever1 .
    			?item2 wdt:P1415 ?whatever2 .
        		?item1 schema:description ?description1 .
        		?item2 schema:description ?description1 .
        		FILTER(LANG(?description1) = "en" && ?item1 != ?item2 && str(?item1) < str(?item2)  ) .
                FILTER (CONTAINS(str(?description1),'('))
    		}
    		LIMIT 1000
    	}
      	SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    
    Try it!
This one gets all items which have identical descriptions *and* whose identical descriptions contain a bracket; this catches most of the cases where two paired items were imported at the same time. Some related queries to find entries which still use the ODNB summary -
  • SELECT DISTINCT ?description1 ?item1 ?item1Label ?item2 ?item2Label
    {
    	{	SELECT DISTINCT ?item1 ?description1 ?item2
    		{
    			?item1 wdt:P1415 ?whatever1 .
        		?item1 schema:description ?description1 .
               FILTER (CONTAINS(str(?description1),'<'))
    		}
    		LIMIT 10000
    	}
      	SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    
    Try it!
This gets all the ones with garbled HTML in the import (there's a couple of hundred)
  • SELECT DISTINCT ?description1 ?item1 ?item1Label ?item2 ?item2Label
    {
    	{	SELECT DISTINCT ?item1 ?description1 ?item2
    		{
    			?item1 wdt:P1415 ?whatever1 .
        		?item1 schema:description ?description1 .
               FILTER (CONTAINS(str(?description1),'['))
    		}
    		LIMIT 10000
    	}
      	SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    
    Try it!
All the ones with square brackets (usually only found in ODNB style, not ours)
  • SELECT DISTINCT ?description1 ?item1 ?item1Label ?item2 ?item2Label
    {
    	{	SELECT DISTINCT ?item1 ?description1 ?item2
    		{
    			?item1 wdt:P1415 ?whatever1 .
        		?item1 schema:description ?description1 .
               FILTER (CONTAINS(str(?description1),'),'))
    		}
    		LIMIT 10000
    	}
      	SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    
    Try it!
Any with the bracketed dates followed by a comma and other text - usually a sign it's the ODNB description
  • SELECT DISTINCT ?description1 ?item1 ?item1Label ?item2 ?item2Label
    {
    	{	SELECT DISTINCT ?item1 ?description1 ?item2
    		{
    			?item1 wdt:P1415 ?whatever1 .
        		?item1 schema:description ?description1 .
               FILTER (CONTAINS(str(?description1),'–'))
    		}
    		LIMIT 10000
    	}
      	SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    
    Try it!
Any with a long dash rather than a hyphen (again, likely from ODNB)
Not sure how useful these are likely to be to you, but thought they might well be of interest to find the ones most likely to need maintenance. Andrew Gray (talk) 12:56, 27 May 2017 (UTC)Reply

Thanks - certainly going to be useful. Now to try to find some time ... Charles Matthews (talk) 20:14, 29 May 2017 (UTC)Reply

@Andrew Gray:: important second thought. Among these pairs are going to be some from a bot batch where, erroneously, incorrect dates were added. On my conscience is the need to go over that whole batch again. If further SPARQL magic could fish out candidates, say by using P1415 and exact matches of birth and death years, perhaps that check could be made with less drudgery. Charles Matthews (talk) 20:50, 31 May 2017 (UTC)Reply

Hmmm - interesting. All ODNB items where birth1 = birth2 and death1=death2? One complication here is that a lot of items will already have dates imported from enwiki (often wrong, via old DNB), or manually specified to day level rather than year, which will throw it out. But I'll see if I can work out something... Andrew Gray (talk) 10:29, 3 June 2017 (UTC)Reply

Not forgotten about these. They have been copied in to User:Charles Matthews/Queries, which I suppose one day will grow into training material. Charles Matthews (talk) 09:36, 13 June 2017 (UTC)Reply

Wikidata isn't a really a triple store edit

In http://moore.libraries.cam.ac.uk/meet-your-wikimedian-residence/extract-transform-load you seem to write that Wikidata is a triple store. I think that's misleading. Wikidata isn't focused on 3D reality. Good Wikidata claims have references with makes them 5D. Very often they also have qualifiers that add additional dimensions. If qualifiers are too complicated for WikiFactMine that's okay, but you are still not left with triples but with 5D entities that include sources.

Apart from that it feels strange to write blog posts without an ability for readers to leave comments when your goal is community outreach. ChristianKl (talk) 14:53, 14 June 2017 (UTC)Reply

@ChristianKl: Thanks for the comments. I'm discussing feedback with the library webmaster.
As to "Wikidata is a triple store", that is obviously not the whole truth, but it is also not untrue: it stores triples. It is not a question of WikiFactMine, but of the intended audience, which is largely librarians rather than tech people. I could also talk about S(⌊R(x,y)⌋, z) being the way a quintuple with properties R and S, S qualifying the statement "R(x,y)", resolves into two triples, but that would be more pleasing to logicians.

Charles Matthews (talk) 15:19, 14 June 2017 (UTC)Reply


blogs without comments are a commonplace now, given the headache of monitoring the rampant incivility. you should really change the thumbnail image. the 15th birthday party one is bigger and more recent. cheers. Slowking4 (talk) 19:13, 15 June 2017 (UTC)Reply
  • @Charles Matthews: My argument is about technology but about ontology. I do think librarians care about ontology and how data get's modeled. A bit more than a decade ago people like Barry Smith came to the conclusion that knowledge isn't just made up of triples. Barry Smith wrotes papers like Against Fantology and formulated Basic Formal Ontology the paradigm of 4D perspectives on reality. Ontologies like the Ontology for Biomedical Investigations are based on Basic Formal Ontology.
Wikidata's function of qualifiers allow it to express 4D perspectives and it's worth for a librarian who wants to understand Wikidata to understand that capability.
Aside from the 3D/4D distinction of Barry Smith there's also the issue of sources. In Wikidata we don't want to store "X has relationship R to Y" but "Book B said 'X has relationship R to Y'".
Our data model is quite different from the triples that the Integrated Authority File of the German national library uses. And I haven't even talked about ranks which add another dimension to the data. ChristianKl (talk) 22:17, 19 June 2017 (UTC)Reply
Thank you for the detailed comments, which I'll try to digest. Charles Matthews (talk) 04:04, 20 June 2017 (UTC)Reply

On the comments: there is now an email link for me on the blog page at the Moore Library. More receny is better!? Charles Matthews (talk) 03:18, 16 June 2017 (UTC)Reply

future scholars will want to track the graying of the beard; need to be "state of the art"; the WMUK EDU one is good also if artistically cropped. Slowking4 (talk) 01:48, 18 June 2017 (UTC)Reply

Wow, the needs of future pogonologists! You're right that I wasn't taking those into account. Charles Matthews (talk) 03:21, 19 June 2017 (UTC)Reply

fyi, there is talk of a GLAM user group https://meta.wikimedia.org/wiki/Wikimedia_GLAM_User_Group ; you might want to write up (copy paste) your exploits at GLAM newsletter https://outreach.wikimedia.org/wiki/GLAM/Newsletter . Slowking4 (talk) 13:47, 23 June 2017 (UTC)Reply

Thanks, useful. Charles Matthews (talk) 09:24, 24 June 2017 (UTC)Reply

Donald Trump pseudonyms edit

Greetings, Charles M., from Deborahjay, "the small and meek" but nevertheless recently bold on my reading of w:Donald Trump pseudonyms on which Item pseudonyms of Donald Trump (Q26869209) is based, not being a Wikimedia list page but rather an actual article. I'm particularly unclear on the Statements, which I regret to have left in something of a muddle. Your advice is solicited at Talk:Q26869209. -- Many thanks, Deborahjay (talk) 08:50, 28 July 2017 (UTC)Reply

Hi - I have contributed over there. Charles Matthews (talk) 09:35, 28 July 2017 (UTC)Reply

Pietro Chiesa edit

You seem to have made some mix ups in Pietro Chiesa (Q38019800) and Pietro Chiesa (Q38020071). You might want to recheck the items and links. Multichill (talk) 12:55, 7 September 2017 (UTC)Reply

OK, I'll have a look. Charles Matthews (talk) 12:57, 7 September 2017 (UTC)Reply
 Y Did a mix'n'match check today. Charles Matthews (talk) 09:54, 25 May 2018 (UTC)Reply

Varvitsiotis' image edit

Hello! (Hoping this is the right place for such a request, if not I'm sorry for being a trouble). I've noticed that you undid some changes to Miltiádis Varvitsiótis (Q12881046) re. the image. There seems to be some confusion here. The image added clearly belongs to Varvitsiotis' grandson (Miltiades Varvitsiotis (Q12881047)) with whom they share the same name. Is it possible to somehow flag any entry or both so bots won't add the wrong image? --cubic[*]star 20:56, 25 October 2017 (UTC)Reply

OK, sorry, that was careless of me. I'm done with politicians now. Charles Matthews (talk) 20:58, 25 October 2017 (UTC)Reply
@CubicStar: It looks like the grandfather's entry had an interwiki & commons category for the grandson, which might be why the image got added. I've sorted them out so hopefully it won't reappear. Andrew Gray (talk) 11:55, 26 October 2017 (UTC)Reply
@Charles Matthews: Thank you! --cubic[*]star 16:24, 26 October 2017 (UTC)Reply

2x Thomas Scott edit

Can you please take a look at Thomas Scott (Q19363661) (1780-1835). There is no reliable source for this item.

I've tried to separate both items, but gave up. --Kolja21 (talk) 17:17, 12 November 2017 (UTC)Reply

101024920 is the correct OBIN for Thomas Scott (1780–1835). As it says in the article, "Thomas Scott (1780–1835), the fourth son, born on 9 November 1780, was educated at Queens' College, Cambridge, graduating BA in 1805 and MA in 1808." It is a subarticle, and Thomas Scott (1780–1835) is (correctly) a co-subject in the article about his father Scott, Thomas (1747–1821), "Church of England clergyman and biblical scholar", who has OBIN 101024919. It is confusing here because the father and son are both called Thomas. Charles Matthews (talk) 17:33, 12 November 2017 (UTC)Reply
Thanks for the fast reply. So I need a subscription to see this info? --Kolja21 (talk) 17:41, 12 November 2017 (UTC)Reply
Yes, the ODNB site is behind a paywall. In the UK, one can read it with a library card. But now there are the Cambridge Alumni Database ID (P1599) and Clergy of the Church of England database ID (P3410) identifiers on the item, and they give enough identifying information. Charles Matthews (talk) 17:49, 12 November 2017 (UTC)Reply

DNB/ODNB matches edit

Hi Charles,

Noticed a couple of duplicated items today where one has an ODNB ID and the other had a DNB link and for some reason we never matched them up. I've knocked up a quick query for any items with a DNB "desribed by source" but no ODNB entry:

SELECT ?item ?itemLabel ?instanceLabel
WHERE
{ 
  { ?item wdt:P1343 wd:Q16014700 . }
  UNION { ?item wdt:P1343 wd:Q15987216 . }
  UNION { ?item wdt:P1343 wd:Q16014697 . } # is described by any DNB volume
  ?item wdt:P31 ?instance
  FILTER NOT EXISTS { ?item wdt:P1415 ?odnb . } 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

Might be of some interest to you! Andrew Gray (talk) 10:52, 18 November 2017 (UTC)Reply

@Andrew Gray: Thanks, that's a very interesting cleanup list. By the way, I'm now of the view that section, verse, paragraph, or clause (P958) should be used to qualify, rather than stated in (P248), when giving Dictionary of National Biography, 1885–1900 (Q15987216) etc. followed by the page for the DNB article. Charles Matthews (talk) 11:42, 18 November 2017 (UTC)Reply

Simon François Ravenet edit

You seem to have mixed up Simon François Ravenet (Q1771158) and Simon Jean François Ravenet (Q24685816). Can you check the authority control links you added? At least two of them were wrong and are on the other person. Multichill (talk) 21:40, 18 January 2018 (UTC)Reply

 Y I have done a check on mix'n'match today. Charles Matthews (talk) 10:00, 25 May 2018 (UTC)Reply

William Henry Bradlay - a mixup of 2 men... edit

Hi,

I divided William Henry Bradley (Q46690366), where you added 2 IDs, from which one was someone else. You can find the other person at William Henry Bradley (Q47484762). --Hsarrazin (talk) 20:05, 22 January 2018 (UTC)Reply

Henry Garnett Venn - Henry Garnett Venn (Q27868514) edit

Who is this person who, according to source, died aged 3 days... and has no link to any other item ? what makes this item interesting ? :/ --Hsarrazin (talk) 21:32, 23 January 2018 (UTC)Reply

The Kindred Britain link says his father is Henry Straith Venn (Q42245867). But I agree that such a short life is not intereating, usually. I did some checking, because the death date given for his mother, Maria Garnett, is inconsistent with the life of the son. It turns out that Kindred Britain is reliable on relationships; but there are some incorrect dates.
In this case, the death date of Maria Garnett is wrong: she lived to 1960.[3]. And the 1908 life of Henry Garnett Venn is correct.[4] Thank you for pointing out this issue.
Charles Matthews (talk) 21:53, 23 January 2018 (UTC)Reply
Thanks.
I don't know the Kindred Britain. What's the content/interest of this base ? Are the people in it notorious for what they did or for who they are (nobility) ? --Hsarrazin (talk) 08:36, 24 January 2018 (UTC)Reply
It's from Stanford University. It began with a study of the background of W. H. Auden (Q178698) the poet. It has grown to about 30K entries, illustrating the networks associated with British literature. Charles Matthews (talk) 08:44, 24 January 2018 (UTC)Reply
ah, interesting ! so it's only about links between people, yes ? have a nice day ! --Hsarrazin (talk) 08:55, 24 January 2018 (UTC)Reply

Wikidata:Property_proposal/broader_concept edit

Pinging you (amongst others) since I suspect this might be relevant to the ContentMine dictionary, as to whether or not you think it would be useful to be able to record the "broader" field in thesauruses that have one, allowing one to reference the thesaurus structure in WDQS queries. Property proposal at Wikidata:Property_proposal/broader_concept. Jheald (talk) 19:26, 10 February 2018 (UTC)Reply

@Jheald: I'm not getting this immediately - need to think some more. Most dictionary terms in the ContentMine sense are thing-like and concrete (diseases, drugs, genes ...) Charles Matthews (talk)
Ah, okay. I wasn't sure what degree of hierarchical organisation there was in the CM dictionary, nor how closely it corresponded to hierarchical statements on Wikidata.
For a different example, consider eg the Getty Art & Architecture Thesaurus (Q611299), as referenced by property Art & Architecture Thesaurus ID (P1014). The first four levels can be seen at User:Jheald/aat, or the full list of terms at User:Jheald/aat/full.
The question is whether, for ingestion / confirmation / quality control / sourcing / reference / comparison / extraction, it would be useful to have a property to record the hierarchical structure manifested in the external source, as distinct from our hierarchical structure that is represented by subclass of (P279), instance of (P31), facet of (P1269) etc. Jheald (talk) 18:11, 12 February 2018 (UTC)Reply
There is a whole can of worms in mereology (Q1194916). If there is a semantic web version of part-whole that seems viable here, then it could be of interest. Charles Matthews (talk) 10:08, 13 February 2018 (UTC)Reply

HoP in mix-and-match edit

Just noticed this edit, which is worrying me a little - in theory there shouldn't be any entries for the post-1690 volumes coming from mix-and-match, as I've done them all, so they should have been picked up by the automatching. It looks like the URLs have been garbled a bit when imported, and so they're not picking up as identical. Do you know if we can do anything about this, or whether we need to ask Magnus to rebuild the list using non-URLencoded forms?

It seems we now have about seventy like this (the Q312591 entry is a known duplicate, the others are all newish). I won't try and remove them just yet since it will only cause problems, but it's something we do need to work out Andrew Gray (talk) 17:14, 3 April 2018 (UTC)Reply

@Andrew Gray: Yes, it started to worry me too when I had a closer look. I thought I was adding a small percentage of missing ones from the far end of the unmatched set on mix'n'match. Some at least are trivial URL variants. I may have to revert them all, which is a bit frustrating.
There will be some better matches to make in that set. It is anyway overwhelmed with soft redirects which are clearly irrelevant. I was avoiding those. The sort of thing that has happened is substitution for parentheses in the URL. That anyway is clear enough in the identifier. Feel free to revert those cases. I'll have a look now. Charles Matthews (talk) 17:44, 3 April 2018 (UTC)Reply
The mix-and-match list has 22634 entries including redirects; there's a "redirect-free" set of URLs here which has 21404. However, they use the percent-encoded format (the same as the one currently on mix-and-match). The ones on Wikidata at the moment use unencoded forms for ( ) ’ - I think those are the only special characters, but there might be a few accented letters.
A list of all missing IDs as of last night is at [5] - there's about 6000, which sounds about right if M&M thinks there's 7000 and it's got all the redirects. Andrew Gray (talk) 18:07, 3 April 2018 (UTC)Reply
Should be cleaned up now, and the query reflects that. I have copied you into a mail about it all. Charles Matthews (talk) 18:33, 3 April 2018 (UTC)Reply

Duplicates edit

  Hello, I'm Marsupium. Before creating an item, please make sure it doesn't already exist. I've merged Walter Beck (Q38074585) with Walter Beck (Q38074573). If you have any questions, you can leave me a message on my talk page. Thanks! --Marsupium (talk) 07:24, 7 June 2018 (UTC)Reply

Henry Kingsbury edit

Can you explain these edits? Henry Kingsbury (Q5724354) is obviously a different person than Henry Kingsbury (Q53508607). Could you please clean this up? Also in mix'n'match. Multichill (talk) 19:42, 28 June 2018 (UTC)Reply

@Multichill: A mistake on my part. Thanks for pointing this out. But there was nothing to fix on mix'n'match. Looking closely at the histories, I see that I matched to the wrong item, undid the matches, and matched to the correct item, within a couple of minutes. Normally mix'n'match would handle the undoes, I think. Under some conditions it might not.
In any case, I have removed the incorrect statements from Henry Kingsbury (Q5724354). My understanding is that this was an anomaly.
Charles Matthews (talk) 19:45, 28 June 2018 (UTC)Reply
Mix'n'match undoes nothing on wikidata, it just unmixes. One needs to separately reverse those entries. Been there, and fouled that up myself in my earlier days with the tool.  — billinghurst sDrewth 07:38, 29 June 2018 (UTC)Reply

GTAA on Wikidata edit

You receive this message because you previously matched persons from the GTAA (the Thesaurus for Audiovisual Archives) with items on Wikidata. We would like to inform you about some improvements that we have made to that catalogue on Mix’n’Match. We have improved the automatic links and added additional information from our catalogue (what we’ve called ‘extracted terms’) to the terms. We hope that this makes matching the thesaurus with Wikidata that much more fun and easier. Read more about this project here or in Dutch WikiProject Dutch Media History nl here. Best! 85jesse (talk) 08:05, 18 July 2018 (UTC)Reply

Find multiple edit

#Prototype focus list batch by Aleksey
SELECT ?item ?itemLabel 
WHERE 
{
  values ?doi {  "10.1186/1743-422X-7-45" "10.3748/WJG.V13.I1.48" "10.1186/1743-422x-7-45" "10.3748/wjg.v13.i1.48" }
  ?item wdt:P356 ?doi
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

health specialty (P1995) edit

Hi. Thank you for adding many health specialty (P1995) statements to disease items. But some of them is invalid. For instance, musculoskeletal disorder (Q4116663), urinary system disease (Q7900883), parasitic infectious diseases (Q1601794), cardiovascular disease (Q389735), etc... is not a medical specialty, but a disease. These items cannot be values of health specialty (P1995). Could you please fix them? Regards, --Okkn (talk) 14:09, 9 August 2018 (UTC)Reply

Thanks for the feedback. The content of the statements is derived from the MeSH tree code (P672) statements that have been added to the disease items. Charles Matthews (talk) 14:22, 9 August 2018 (UTC)Reply
What do you mean by “derived from the MeSH tree code (P672)”? How did you determine the health specialties from MeSH and reconcile them with Wikidata entities?
What I want to say is that health specialty (P1995) in premature ventricular contraction (Q26781137), which you have added, for example, should be cardiology (Q10379), not cardiovascular disease (Q389735). --Okkn (talk) 14:38, 9 August 2018 (UTC)Reply
With a colleague I'm working on the ScienceSource project (see e.g. WD:SSFL) and as a preliminary we have done a conversion from MeSH descriptor ID (P486) to MeSH tree code (P672). The advantage is that MeSH tree code (P672) gives well-structured identifiers. I apologise if my first attempt to exploit the information is not so appropriate. I was last working in this area over a year ago, and I have the impression that there have been some changes made since then. Charles Matthews (talk) 14:46, 9 August 2018 (UTC)Reply
I know the tree structure of MeSH, and your starategy using MeSH code is right. The trouble is that you didn't distinguish medical specialty (Q930752) from disease (Q12136). Do you understand the difference between urology (Q105650) and urinary system disease (Q7900883)? --Okkn (talk) 14:59, 9 August 2018 (UTC)Reply
Yes, I understand that. I find it confusing that the English description for urology (Q105650) mentions only surgery, but reading the English Wikipedia article about the "branch of medicine that focuses on surgical and medical diseases of the male and female urinary-tract system and the male reproductive organs" is clearer. Thank you for understanding what I'm trying to do. urinary system disease (Q7900883) for me, this afternoon, was being used as a placeholder, really. I'm happy to make any replacements, cardiology (Q10379) for cardiovascular disease (Q389735), urology (Q105650) for urinary system disease (Q7900883), and so on.
To go into more detail, I'm working with the query at Wikidata talk:WikiFactMine/Core SPARQL#MeSH Code tree handling for "health speciality". That is a case analysis into nearly 30 cases: I can see I have started off in a clumsy way, but I will fix everything, of course. The query succeeds in extracting information from MeSH Code. Probably it can be refined, and certainly the handling of the cases should be changed. In 2017 I did similar things for ICD-9 and ICD-10 (and now ICD-11 is possible, I think) which was harder to code, but the specialty side was much more obvious.
So I have work to do here, and I'm grateful for the pointers you have given me. Charles Matthews (talk) 16:48, 9 August 2018 (UTC)Reply

Can your project pull this information? edit

I am still thinking about what your project may or may not do.

Is identifying Wikidata:Property proposal/risk factor as a term in papers and linking it to structured Wikidata about what risk factor a paper identifies among the possible outcomes of what your project might accomplish? Blue Rasberry (talk) 13:26, 15 August 2018 (UTC)Reply

@Bluerasberry: Yes, in principle, it could do. What we'd usually be doing is searching for statements equivalent to "X is a risk factor for Y" where X runs over some definite list of "risky stuff" and Y runs over a list of, say, "diseases". The community discussion is clearly wide-ranging, but this could be a use case.
We certainly have such lists, under the name "dictionaries", for diseases, for example a list of cancers. It would be possible, with glyphosate (Q407232) in mind, to create a list for X of herbicides, or some such list of chemicals. Then what the tech would do is to find, in a bunch of papers, places where an X term and a Y term are close in the text.
The human-assisted step is then to have a person actually check the language: what does it assert, if anything, about an association between the herbicide and the cancer? The conclusion would take the form of an annotation. If the "risk factor" property existed, then with caveats about when it should be applied, the annotation could note the presence of a candidate statement for Wikidata. The source would still have to pass (our version of) MEDRS to be written into Wikidata as a referenced statement.
So, text-mining with definite dictionaries; human fact-check; scrutiny of the source; and the statement is passed over here if all is well. It is a standard workflow, but clearly only as good as the various inputs. Charles Matthews (talk) 13:44, 15 August 2018 (UTC)Reply

PubMed article license task from Cambridge event edit

I've uploaded the script I used for generating the QuickStatements commands on Saturday to https://github.com/tmtmtmtm/sciencesource-pmc-licenses --Oravrattas (talk) 21:42, 22 October 2018 (UTC)Reply

Many thanks! I have captured the edits on User:Charles Matthews/ScienceSource. Charles Matthews (talk) 03:34, 23 October 2018 (UTC)Reply

OpenRefine tutorials edit

Hi!

It was great catching up with you at the meetup! If you want to give OpenRefine a try, we have loads of tutorials to get started:

I hope you enjoy! If anything is unclear I would love to know the areas where these materials can be improved. − Pintoch (talk) 00:10, 8 November 2018 (UTC)Reply

Thanks. I enjoyed talking to you, too. Charles Matthews (talk) 06:45, 8 November 2018 (UTC)Reply
@Pintoch: You asked about the focus list. I was writing some documentation about it just now: http://sciencesource.wmflabs.org/wiki/Focus_list_and_filtering_it . Charles Matthews (talk) 11:16, 8 November 2018 (UTC)Reply
Great, thanks! Would you also have an example of the PMC topics we were talking about? − Pintoch (talk) 11:53, 8 November 2018 (UTC)Reply
@Pintoch: For topics it was PubMed — PMC was for licenses! Here's a page linked from Eight blue babies. (Q35103130): https://www.ncbi.nlm.nih.gov/pubmed/?term=12685296. Expanding the + section, the topic is definitely methemoglobinemia (Q748442). When you click on the link with text Methemoglobinemia/diagnosis, and choose the option "search in MeSH", you get https://www.ncbi.nlm.nih.gov/mesh?term=%22Methemoglobinemia%22. OK, the top hit is then the page https://www.ncbi.nlm.nih.gov/mesh/68008708, where we have
Tree Number(s): C15.378.619 MeSH Unique ID: D008708
and the problem is solved. In https://tools.wmflabs.org/wikidata-todo/resolver.php one can set P486 for MeSH ID and value D008708, and the result is Q748442 (or use SPARQL directly, who cares). Here the label confirms that the top hit is OK. For diseases the MeSH catalog on mix'n'match was completed this summer, so this works generally. (For other kinds of topics the matching would need to happen.)
It is just strange. Obviously "Methemoglobinemia" could be extracted from the PubMed page. MeSH has a SPARQL endpoint at https://id.nlm.nih.gov/mesh/query that could be useful. But I don't see quite how to avoid the search step and some kind of checking inference back from the MeSH information. Charles Matthews (talk) 12:39, 8 November 2018 (UTC)Reply
Oh I see, okay, so PubMed is simply not aligned to MESH, it only has topic strings which can be searched in MESH. − Pintoch (talk) 14:25, 8 November 2018 (UTC)Reply
Actually another way to do it is the other way round: start with a MeSH term and search PubMed: e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=%22Methemoglobinemia%22%5BMeSH+Terms%5D . That would be easy scraping. If we did enough of that, then the "alignment" would exist on Wikidata ... Charles Matthews (talk) 14:38, 8 November 2018 (UTC)Reply

Samuel Heathcote edit

There does seem to be an artist Samuel Heathcote - Royal Collection has the dates 1656-1708 but perhaps two men have been conflated? Will do some research. - PKM (talk) 21:46, 9 November 2018 (UTC)Reply

Haven't found anything on the painter so far except the Royal Collection. I expect their dates are wrong, but I have made a separate item and moved the painting links there. - PKM (talk) 23:06, 9 November 2018 (UTC)Reply
In the matter of the Raphael derived work, I assumed Heathcote owned it. He was very wealthy[6]. Charles Matthews (talk) 06:08, 10 November 2018 (UTC)Reply
That seems very likely. - PKM (talk) 20:04, 10 November 2018 (UTC)Reply

Is John Dixon (Q30020905) a fictional human (Q15632617)? edit

Hello, you've created the statement John Dixon (Q30020905)instance of (P31)fictional human (Q15632617). Where does it say that in s:en:Dixon, John (d.1715) (DNB00)? Thanks a lot in advance! Best, --Marsupium (talk) 01:23, 18 March 2019 (UTC)Reply

@Marsupium: On s:Dixon, John (d.1715) (DNB00) there is a note. It cites the ODNB article "Dixon, Matthew", and there it states "In her article 'Nicholas Dixon, limner, and Matthew Dixon, painter, died 1710', Mary Edmond noted that 'Vertue's “Mr John Dixon” was … non-existent', and that 'Walpole added to the confusion by combining details given by Vertue about Nicholas Dixon the limner with those about “John” Dixon, and attributing them to the latter'." That is citing Mary Edmond, Nicholas Dixon, Limner: And Matthew Dixon, Painter, Died 1710, The Burlington Magazine Vol. 125, No. 967 (Oct., 1983), pp. 610-612 (3 pages), Published by: Burlington Magazine Publications Ltd., https://www.jstor.org/stable/881428. Charles Matthews (talk) 04:17, 18 March 2019 (UTC)Reply

Health specializaty edit

Hi Charles, With regards to the last changes to cancel health specialization, I remind you that there was a discussion in the past (https://www.wikidata.org/wiki/Property_talk:P1995). In this discussion it was decided to change the specialization from medical to health, so as to be able to include Psychology (clinical psychology, in particular). I hope everything is clear, see you soon. --Dapifer (talk) 11:58, 20 March 2019 (UTC)Reply

Query service lag edit

Hey, we're currently experiencing some lag on the query service. One thing that has much impact on this is edits done on fairly large items. Could you postpone your current task for a moment to give the query service some time to recover? Thanks! Sjoerd de Bruin (talk) 16:00, 24 April 2019 (UTC)Reply

Not straightforward. Charles Matthews (talk) 16:11, 24 April 2019 (UTC)Reply
Or restart them tomorrow? --Egon Willighagen (talk) 16:26, 24 April 2019 (UTC)Reply
Do I understand correctly you're using QuickStatements 1 and not 2? I can recommend using the new QuickStatements (with the old format) (next time), because that one allows you to pause jobs.--Egon Willighagen (talk) 16:29, 24 April 2019 (UTC)Reply
To explain: I do understand the issue here. I'm working on a project with a deadline, set by the WMF grant, which comes at the end of May. The processing I need to do comes in several stages: I need to move content in batches twice through QuickStatements. The batch currently running through happens to be the final one in one segment of the project. After that is done, which will be less than 30 minutes now, I can pause until tomorrow. But from the point of view of project planning, there are dependencies. Charles Matthews (talk) 16:49, 24 April 2019 (UTC)Reply
The system seems more or less back to normal. Thanks for considering! --Egon Willighagen (talk) 17:21, 24 April 2019 (UTC)Reply

ScienceSource edit

Hello Charles,

I have visited the science source wikibase instance site. I cant create an account there. Can you please look if there is a problem. -- Hogü-456 (talk) 18:46, 2 May 2019 (UTC)Reply

Hi Charles,

The website says it's down an will be back up in 2020. That time has passed. What's the status of this project? --Hunterboerner (talk) 17:34, 10 May 2021 (UTC)Reply

@Hunterboerner: After a spam attack in 2019, the site needed some developer attention. Because of the pandemic, that has not been easy.
I continue to work on the project, in pages at WD:SS. There has in fact been a large cleanup operation here on Wikidata. A major part of that is documented at Wikidata:ScienceSource project/Focus list, main subject MeSH errors. A large family of errors in statements here for main subject (P921) has been tackled, and that page documents the current situation (for the focus list), that the errors are now limited to accurate naming of some cancer topics.
I have continued to work on Wikidata with the NCBI2wikidata tool, using some better techniques applying command line scripts.
Going back to 2019, the official end of the project was difficult, because the site needed an internal migration in its hosting, for more memory.
In 2019, I decided to concentrate on the MeSH descriptor ID (P486) aspects, and by December 2019 I had completed the mix'n'match catalogs for it. Some numbers for that are at Wikidata:ScienceSource project/MeSH and cleanup dashboard. I also worked to reduce the constraint violations for that property, which were very numerous because of bot postings.
To sum up, the ScienceSource pipeline is quite long, and since the official end of the project I have worked mainly on the middle parts of it, improving the quality of data here on Wikidata.
As for MEDRS, Wikidata:ScienceSource project/MEDRS report was the report in May 2019. I have put some further thoughts on the talk page there: it would be possible to revisit that side of the project.
Thank you for your interest. Charles Matthews (talk) 17:52, 10 May 2021 (UTC)Reply

Subclass or cause? edit

About this, do you think that "subclass" or "cause" would make more sense? WhatamIdoing (talk) 15:20, 27 August 2019 (UTC)Reply

@WhatamIdoing: Interesting point. I'm following https://www.ncbi.nlm.nih.gov/mesh/?term=Splenosis which has Splenosis subordinate to Splenic Rupture subordinate to Splenic Diseases. So thinking of everything as a disease (Q12136) broad sense, i.e. just any abnormal condition, "subclass" tends to be used to describe a more specialised condition. But MeSH is perhaps likely to miss this sort of point sometimes: obviously splenosis is a kind of side-effect of the rupture. So has cause (P828) should be OK. Charles Matthews (talk) 15:32, 27 August 2019 (UTC)Reply
Maybe both would be the best approach? WhatamIdoing (talk) 20:47, 27 August 2019 (UTC)Reply
@WhatamIdoing: I've used facet of (P1269), which seems appropriate here. Charles Matthews (talk) 07:47, 4 September 2019 (UTC)Reply
That sounds like a reasonable approach. Thanks. WhatamIdoing (talk) 17:12, 4 September 2019 (UTC)Reply

Community Insights Survey edit

RMaung (WMF) 17:38, 10 September 2019 (UTC)Reply

Wrong data edit

Ukrainian Wikipedia (Q199698) have both official name (P1448) and title (P1476). Why? Please remove title (P1476), thanks!!! --2001:B07:6442:8903:C4AD:B849:2AF8:ED72 15:51, 20 September 2019 (UTC)Reply

Reminder: Community Insights Survey edit

RMaung (WMF) 19:54, 20 September 2019 (UTC)Reply

your batch edits edit

I don't think that adding "main topic"--->"enzyme" to highly specific articles helps anyone, not even an AI. I mean even copying the enzyme name from the title and finding the corresponding enzyme family by text comparison would be more valuable and could be easily implemented. --SCIdude (talk) 17:23, 30 September 2019 (UTC)Reply

@SCIdude: Thank you for the comment. Let me explain that project as a whole.
The NCBI2wikidata tool in use is a custom tool for adding disease and other primary metadata to article items. It only adds to items about articles that are reviews and which are under a Creative Commons license. These articles, for the ScienceSource project as originally conceived, were the key ones: we were interested only in such articles, and needed 30K of them, with some other side conditions.
I would say that tagging items as reviews with a CC license is anyway a very positive thing to do. I have brought the number of articles of that kind up to 60K recently, and continue to work in that direction. So I have run numerous MeSH search items on the PubMed API used by the tool, "Enzymes" being just one of these. These are documented at
Wikidata:ScienceSource project/NCBI2wikidata rsplus1
and in particular in the section Wikidata:ScienceSource project/NCBI2wikidata rsplus1#Runs to review (non-leaf). The current run labelled "enzyme" is at the bottom of that section.
"Non-leaf" means the search term used is not a leaf of the MeSH topic tree. The QuickStatements code for "rsplus1" has the search term added to it appears as a main subject (P921) statement, which is what you are seeing.
To take an article at random, Plant Ribosome-Inactivating Proteins: Progesses, Challenges and Biotechnological Applications (and a Few Digressions). (Q41918293), by going to the PubMed link on the item I can see that "Enzymes" appearing as "enzyme" in the main subject statement actually stands for the "Ribosome Inactivating Proteins" MeSH term; in other words ribosome-inactivating protein (Q24788543). I would do that as a conscious check through the "enzyme" additions, I would be able to make the main subject precise, and would be able to add another MeSH major term, "Plants". Further, as I have discovered just now, ribosome-inactivating protein (Q24788543) should carry a MeSH descriptor ID (P486) statement that isn't there just yet.
To sum up, there is a two-step workflow set up for improving these high-level terms by replacing them by accurate MeSH major terms.
In constrast, the other main subject there, "biotechnology", is unreferenced, though one can see it came from the title. It is a minor MeSH term if you look on PubMed. The tagging I'm doing is upmarket of that.
Charles Matthews (talk) 18:17, 30 September 2019 (UTC)Reply
Thanks for the explanation. Can you please give an example of an item after the full workflow? --SCIdude (talk) 19:04, 30 September 2019 (UTC)Reply
@SCIdude: OK, I have worked over MYO5B, STX3 and STXBP2 mutations reveal a common disease mechanism that unifies a subset of congenital diarrheal disorders: A mutation update. (Q47269140) chosen at random. It had been in a previous pass, and I see now that "enzyme" was added without reference and date. That's a glitch in this one batch: the previous addition of "genetic variation" has the usual dated reference. Justifying your query.
So https://www.wikidata.org/w/index.php?title=Q47269140&action=history shows I took about 20 minutes in this case. I did have to create two new items, for Myosin type 5 and for Qa-SNARE proteins. The substitutions for "genetic variation" (which I got wrong first time) and for "enzyme" were forced from the MeSH terms because the MeSH code strings have known initial strings. Charles Matthews (talk) 20:28, 30 September 2019 (UTC)Reply
It's a pity that you need to do this by hand. But, even if we had a complete InterPro import the items would have no MeSH link, so it can't be automatized. Having a nearly complete mapping MeSH --> WD seems desirable. Do you know of any plans about such a project? --SCIdude (talk) 05:24, 1 October 2019 (UTC)Reply
@SCIdude: I am actually working on MeSH. Indeed, with a 1-to-1 match of MeSH into Wikidata then software can take over. I'm not a developer, but it seems to me that NCBI2wiki, written in golang, could read across the major MeSH terms from PubMed by some modifications. The question is how to get there.
So the issues with MeSH were initially quite complex: a large collection of database constraint violations for MeSH descriptor ID (P486), which I have got on top of, for the D-numbers, just recently. There were many MeSH IDs in the wrong places: there are still 100 on gene items that need to be on the corresponding proteins. New properties were created, so in particular the M-numbers now have their own property.
MeSH descriptor ID (P486) now occurs on 20K items, a recent milestone. I have done some systematic work on organic chemistry, and am currently using the MeSH Organisms catalog on mix'n'match where there is some low-hanging fruit.
So, yes, I have been active to that end for some weeks now. Completeness is a few months away. When it is a bit closer, I'll think more seriously about adapting the existing metadata tool. I see the goal of automation as within reach.
I could spend the rest of my life adding these main subjects by hand. That is not my intention. Charles Matthews (talk) 05:42, 1 October 2019 (UTC)Reply
Tools work only if items already exist so I might look into an InterPro update import. One of its problems is to make clear that InterPro domain entries actually are families that collect all proteins with that domain, so they should be instance-of protein family. Currently I'm finishing transport protein families from the TCDB (up to a certain depth), and as a survey will look today how many of them are covered by MeSH, maybe adding the most general IDs. So I think we agree on priorities. Best regards. --SCIdude (talk) 06:01, 1 October 2019 (UTC)Reply

New page for catalogues edit

Hi, I created a new page where I started collecting sites that could be added to Mix'n'match and I plan to expand it with the ones that already have scrapers by category. Feel free to use, expand. Best, Adam Harangozó (talk) 15:09, 19 October 2019 (UTC)Reply

problem with main subject additions edit

You are adding wrong main subject values, are you aware of this? E.g. Q21145003 has nothing to do with globulins or blood proteins, and the MeSH terms of its PMID are not the source of it. If I were you I'd stop the bot. --SCIdude (talk) 18:01, 8 March 2020 (UTC)Reply

@SCIdude: Thanks for the comment. I have not in fact used that metadata bot recently: I had done enough with that version, and I'm trying to learn enough of the programming language to modify it.
In this case, there was no actual mistake: the information from the PubMed API was correct. The "globulins" subject can be seen on https://www.ncbi.nlm.nih.gov/mesh/?term=Antitoxins, and "Antitoxins" is one of the two starred MeSH terms on the PubMed page https://pubmed.ncbi.nlm.nih.gov/19325885-bacterial-toxin-antitoxin-systems-more-than-selfish-entities/ for the paper. One of the diagrams on the Antitoxins page shows "Globulins" five levels up from "Antitoxins".
By the way, in December I was also completing the MeSH catalogs on mix'n'match. So all of those are now matched into Wikidata. But since the annual updates in 2018, 2019 and now 2020 have happened since the catalogs were uploaded, there are some hundreds of current MeSH terms still missing. It would be interesting to have those new MeSH terms on mix'n'match. Charles Matthews (talk) 19:10, 8 March 2020 (UTC)Reply
But matching the title with antitoxin is wrong, it should have matched with toxin-antitoxin system or pair Q3495384 which is a biological process in bacteria, while antitoxins are immunglobulins in vertebrates. --SCIdude (talk) 19:58, 8 March 2020 (UTC)Reply
This leads me to the question, how are main subject values updated if the item where the MeSH id was placed changes? Will your bot update the main subjects on all article items that have the old subject? In the last months I have moved lots of MeSH ids to more correct items, and the case above would need also a move. What happens in the articles? --SCIdude (talk) 20:13, 8 March 2020 (UTC)Reply

(edit conflict)

@SCIdude: On the first point, you seem to be saying that the "Antitoxins" MeSH term assigned by PubMed is incorrect. In that case of course "globulins" should be removed as main subject. Not a software issue.
On the second point: there were numerous problems initially, of the same kind. I have a list of topics where I should check all cases. The current version of NCBI2wikidata cannot automate such a process. A more "generic" version could be more helpful, simply reading all the major MeSH terms across from the PubMed API.
For that to work smoothly, it was first necessary to complete the MeSH matching here (or have it almost done). So I did that task.
To have a really good system, there should be another advance in the bot, such as is available in principle in bots using Rust. In principle, good checking and cleanup can be available that way.
Charles Matthews (talk) 20:25, 8 March 2020 (UTC)Reply
In summary, yes, it's their problem. And on the updating, this is a widespread issue in WD. --SCIdude (talk) 05:27, 9 March 2020 (UTC)Reply

@SCIdude: By the way, you mentioned the issue of MeSH descriptor ID (P486) statement being moved to other items. This kind of query can find such cases:

#Check main subject referenced to PubMed for whether the topic has MeSH ID
SELECT ?item ?topic
  WHERE {?reference pr:P248 wd:Q180686.
         ?statement prov:wasDerivedFrom ?reference.
         ?item p:P921 ?statement.
         ?item wdt:P698 [ ].
         ?statement ps:P921 ?topic.
         MINUS {?topic wdt:P486 [ ]}
               
        }
  LIMIT 10
Try it!

When I ran it just now, the first topic was preproinsulin (Q7240673) where you moved it; and the next nine hits were for breast cancer (Q128581), where I moved it to breast neoplasm (Q58833934) in June 2019. Once these topics are found in this way, in principle a bot can fix them. Charles Matthews (talk) 17:11, 9 March 2020 (UTC)Reply

Gerald Wellesley edit

Hi! I reverted the merge you made between Q90466544 and Gerald Wellesley (Q10504274) because you appeared to be merging a human with a disambiguation page. I would have attempted a human-to-human merge, but I cannot convince myself that the human you have identified is the same as any of the various Gerald Wellesleys we already know about. In particular, looking at the Camridge Alumni page, is "1788" supposed to be a date of birth or a date of matriculation? Cheers, Bovlb (talk) 16:26, 14 April 2020 (UTC)Reply

@Bovlb: Q90466544 is actually for a disambiguation page, in a database. But I agree it was a bad merge. Q90466544 should be deleted. Charles Matthews (talk) 16:35, 14 April 2020 (UTC)Reply
Ah. It's gone. Bovlb (talk) 16:39, 14 April 2020 (UTC)Reply

Q19036877 edit

Be careful not to add duplicate main subject (P921) to each biographical article (Q19389637). I've had to undo some of your edits due to this. ミラP 19:28, 28 May 2020 (UTC)Reply

@Miraclepine: Apologies. It was an oversight, and I wasn't aware that you were working on P921. Charles Matthews (talk) 20:00, 28 May 2020 (UTC)Reply

wrong main subject edit

Hi! This bot change seems totally wrong. Can you please have a look? --SCIdude (talk) 09:10, 14 September 2020 (UTC)Reply

@SCIdude: Yes, you are right. It is listed on Wikidata:ScienceSource project/Focus list, main subject MeSH errors, which is my page for systematic corrections; but it is not caused by MeSH. A wrong Q-number was used with a bot run. There is some virus topic that should be substituted. But I have a newer technique, now. Charles Matthews (talk) 09:21, 14 September 2020 (UTC)Reply
Very good! --SCIdude (talk) 09:24, 14 September 2020 (UTC)Reply

We sent you an e-mail edit

Hello Charles Matthews,

Really sorry for the inconvenience. This is a gentle note to request that you check your email. We sent you a message titled "The Community Insights survey is coming!". If you have questions, email surveys@wikimedia.org.

You can see my explanation here.

MediaWiki message delivery (talk) 18:45, 25 September 2020 (UTC)Reply

Solution edit

Maybe solution (Q5447188)subclass of (P279)pharmaceutical preparation (Q66089252) is true for MeSH classification tree (that includes only medical terms), but such statement does not seem to be true for solution (Q5447188) in general. Wostr (talk) 10:09, 10 October 2020 (UTC)Reply

@Wostr: OK, I have removed it. Charles Matthews (talk) 10:23, 10 October 2020 (UTC)Reply

another mesh subject issue edit

Q60939671 got Q426145, probably from the heme part of "heme oxygenase". They should stop matching single words that are also part of multiple-word terms? --SCIdude (talk) 08:22, 11 October 2020 (UTC)Reply

@SCIdude: Yes, in the sense that https://meshb.nlm.nih.gov/record/ui?ui=D006418 on the MeSH Tree Structures tab has "Heme" as a narrower term of "Polycyclic Compounds"; and "Heme" is given as a starred MeSH term on https://pubmed.ncbi.nlm.nih.gov/30583467/, as well as "Heme Oxygenase-1". The abstract mentions heme metabolism. Doesn't seem too bad to me. Charles Matthews (talk) 08:37, 11 October 2020 (UTC)Reply
It is completely off the mark IMO. --SCIdude (talk) 08:52, 11 October 2020 (UTC)Reply
OK, the broader terms are only placeholders. Charles Matthews (talk) 08:54, 11 October 2020 (UTC)Reply

[WMF Board of Trustees - Call for feedback: Community Board seats] Meetings with the Wikidata community edit

The Wikimedia Foundation Board of Trustees is organizing a call for feedback about community selection processes between February 1 and March 14. While the Wikimedia Foundation and the movement have grown about five times in the past ten years, the Board’s structure and processes have remained basically the same. As the Board is designed today, we have a problem of capacity, performance, and lack of representation of the movement’s diversity. Our current processes to select individual volunteer and affiliate seats have some limitations. Direct elections tend to favor candidates from the leading language communities, regardless of how relevant their skills and experience might be in serving as a Board member, or contributing to the ability of the Board to perform its specific responsibilities. It is also a fact that the current processes have favored volunteers from North America and Western Europe. In the upcoming months, we need to renew three community seats and appoint three more community members in the new seats. This call for feedback is to see what processes can we all collaboratively design to promote and choose candidates that represent our movement and are prepared with the experience, skills, and insight to perform as trustees?

In this regard, two rounds of feedback meetings are being hosted to collect feedback from the Wikidata community. Two rounds are being hosted with the same agenda, to accomodate people from various time zones across the globe. We will be discussing ideas proposed by the Board and the community to address the above mentioned problems. Please sign-up according to whatever is most comfortable to you. You are welcome to participate in both as well!

Also, please share this with other volunteers who might be interested in this. Let me know if you have any questions. KCVelaga (WMF), 14:32, 21 February 2021 (UTC)Reply

New tool edit

I created a new tool to simplify adding main subjects to our datasets. Feel free to try it out 😃 Wikidata:Tools/ItemSubjector--So9q (talk) 11:03, 18 September 2021 (UTC)Reply

@So9q: I have much more respect for inferred from abstract (Q75484171) than I do for inferred from title (Q69652283). Charles Matthews (talk) 11:11, 18 September 2021 (UTC)Reply
Interesting. Can we reliably find those programmatically for some or all articles?--So9q (talk) 18:07, 18 September 2021 (UTC)Reply
Well, my point is that mechanical extraction of topics from titles is not a reliable method, and really needs the sanity check of someone reading the abstract, if that is available. Your method seems reasonable: but just saying "inferred from title" doesn't give me confidence. Better methods are use of keywords, when those translate into Q-numbers; or use of MeSH terms from PubMed. Charles Matthews (talk) 03:48, 19 September 2021 (UTC)Reply
For example, I see just now on Extracellular vesicle-derived DNA for performing EGFR genotyping of NSCLC patients. (Q49645159) that you have added the subject "non-small cell lung cancer" based on NSCLC in the title. There was already the subject "non-small cell lung carcinoma", which is more accurate, and is the MeSH term given. It turns out that "non-small cell lung cancer" is an index term (keyword) supplied by the authors, and is also in the text of the abstract. So I can't really object to it. But the heuristic mentioned is not a great one. Charles Matthews (talk) 05:29, 19 September 2021 (UTC)Reply
Thanks for taking the time to critique the crude method used here. I wholeheartedly agree with you and wish we had authors and publications do the categorization (and publish it openly as linked data) themselves that we can just link to it. This is the least bad way of curatition I have come up with so far :/--So9q (talk) 15:00, 19 September 2021 (UTC)Reply
@So9q: The points that come up are (i) the keywords are not taken from a restricted vocabulary (such as MeSH), (ii) they are often abbreviations and so ambiguous, and (iii) they are a good guide to cutting-edge research concerns, while MeSH is more conservative. There has been plenty of trouble so far with automation for P921 and incorrect disambiguation. Wikidata items are sometimes created specifically to be targets.
The keywords, I believe, are on a PubMed API. (I have not used it - the NCBI2wikidata tool I use imports MeSH terms from PubMed, and I consider those to be a good baseline.) The disambiguation issue is serious, but probably better use could be made of the keywords. Charles Matthews (talk) 05:49, 20 September 2021 (UTC)Reply
A new tool 🤩. Nice QA page you created there. Would it maybe benefit from being moved to the relevant Wikiproject working on science sources (WD:WikiProject Source)? I create subclass of items all the time when I work on adding main subjects. One example from today: charged particle therapy (Q108607000), does that look ok to you?--So9q (talk) 06:35, 20 September 2021 (UTC)Reply

So that is different from heavy ion radiotherapy (Q493976) and particle therapy (Q38276304) (for MeSH D063193)? Charles Matthews (talk) 07:11, 20 September 2021 (UTC)Reply

Is the "Alumni Oxonienses" QUEST query restricting to people living in the time covered by the article? edit

Hi, I just noticed your "Alumni Oxonienses" QUEST query. Nice use case! I had look at how you restricted it to the time period covered by the article? But it is not doing that, right? Is that intentional? --Egon Willighagen (talk) 09:42, 4 August 2022 (UTC)Reply

@Egon Willighagen: Hello again! Well, I asked Magnus to set up the query for me, and he used my name to post it. So I don't yet understand exactly how it works. (And so far it seems not to be live? I don't know why.)
Anyway, this all arises from work on s:Alumni Oxonienses: the Members of the University of Oxford, 1715-1886, a long digitisation project started on English Wikisource in 2010, and completed a few weeks ago. In the direction of adding main subject (P921) statements to items here about the individual articles, there is the topicmatcher tool to use. Miraclepine (talkcontribslogs) has very nearly finished (maybe finished) creating the 63K items here for the work. There are nearly 7K main subjects now.
The QUEST query is intended to fill in the link in the other direction. There are quite a number of other significant works on Wikisource for which similar queries would be useful. So it would be good to fix the current teething problems.
On your point: I think you are suggesting a check in the SPARQL to look at the dates on items for the people who are the objects of the P921 statements. This is a good idea anyway. The date range starts at 1715, but not all entries are for people who were born by about 1695. Still, checking for those born earlier might be reasonable. Also, to have entered the University of Oxford by 1886 means usually you would have been born by 1870. So a date of death after 1960 should be worth checking.
To do more - and perhaps this is what you meant - one could extract dates from the article text? Some actual text-mining. Charles Matthews (talk) 10:06, 4 August 2022 (UTC)Reply
Yeah, I would not suggest Oxford alumni who were born after 1870 as main topic. Personally, I prefer to start being overly save, rather than having to clean up afterwards. Right now, it's a bit tricky to provide additional context, like birth/death date of the alumni in this case. @Magnus_Manske:, what do you think? --Egon Willighagen (talk) 12:31, 4 August 2022 (UTC)Reply

InterPro families edit

Hi Charles, I know this is not immediately clear but please note that InterPro families like Q24769481 are defined by InterPro from matching specific amino acid patterns. In contrast, an enzyme class like Q3664981 is defined to be all enzymes that have this activity (which potentially includes proteins with different aa sequences). These are different concepts, so I undid your merge. I may add "different from" statements to such pairs in the future. SCIdude (talk) 15:16, 11 August 2022 (UTC)Reply

@SCIdude: OK, thanks for the clarification. The whole business of protein families is difficult, from the point of view of MeSH descriptor ID (P486). I do what I can. Since there is sometimes a MeSH D-number concept of a protein, and then a Mesh C-number concept for the human protein, it seems clear that the MeSH D-number should mean a family, for some taxon range, unless the scope is defined clearly to be humans. Most of those protein families would not be in InterPro (and when they are, in a broad sense, your comment clarifies why the family may include some odd bacterial proteins so the fit would not be exact). So it seems there may not be neat solutions, as things stand. Charles Matthews (talk) 16:28, 11 August 2022 (UTC)Reply
It is even more difficult, as InterPro frequently adapt their patterns, changing the set of proteins contained. SCIdude (talk) 16:33, 11 August 2022 (UTC)Reply

WBIS edit

Hi Charles, I've started having a go at WBIS on Mix'n'Match - focussing on the African Biographical Archive entries. I noticed you'd also spent time there! A couple of questions - 1. is there a wikipedian path to access to WBIS? 2. am I right that the dates mentioned for people in WBIS Mix-n'Match are often for life-events (assumption of political positions etc.) which are neither their birth nor their death? Dsp13 (talk) 15:31, 12 November 2022 (UTC)Reply

@Dsp13:} On point (2), correct, they may often be birth dates but often not, so only vaguely indicative.
I hadn't thought about (1). The catalog is on mix'n'match because the databases are commonly used in German academia, I'm told. I don't think the Wikipedia Library have a deal with them.
I look things up all the time on mix'n'match, but (MeSH aside) I don't specialise in anything there these days. The ODNB coverage is probably deficient on Wikidata now. I noticed just recently that Jill Craigie (Q6192780) has had an ODNB article since July - I wonder what else. Charles Matthews (talk) 16:18, 12 November 2022 (UTC)Reply
Thanks for the reply. When I was in academia, I used WBIS (only really the British Biographical Archive) at the Wellcome Library. Earlier as a student I'd used it on microfiche in the UL. I found it a fantastic collection of published biographical entries on C18th and C19th people beneath the ODNB radar. I've never had institutional access to it which allowed me to use it remotely, though, and haven't used it at all in the last decade.
While links to WBIS are firewalled, actually providing the link to Wikidata is of little direct use. But it's still a useful redlink list. I'm mid-way through a preliminary pass through people provisionally matched from the African Biographical Archive. After that I plan to start adding new items. I haven't thought about ODNB in some time. Dsp13 (talk) 16:42, 12 November 2022 (UTC)Reply
@Dsp13:} A recent development, Quest by Magnus, is discussed in the thread two above. It is now working for my example of Alumni Oxonienses (needed more memory for my Chromebook). See https://quest.toolforge.org/#/21 for that. Much more could be done with that approach, in digital humanities, complementary to https://topicmatcher.toolforge.org/#/wikisource. Blogpost by Magnus. The underlying SPARQL queries might be useful to you in some way. This tool should really be running for all the major reference works on Wikisource.
In other news, User:Magnus Manske/Author strings has created a very large crop of low-hanging fruit in the WikiCite area, and we should start getting better entity identification for scientific researchers. Charles Matthews (talk) 08:40, 13 November 2022 (UTC)Reply
Yes, I've installed the author strings gadget - which seems to exploit the citation graph nicely to get its candidates - but I get the impression that author strings are still being created faster than any mechanism to match them to authors! Thanks for the Quest heads-up - I'll try to get my head around it. Dsp13 (talk) 14:46, 13 November 2022 (UTC)Reply
@Dsp13:} On the author strings and ORCIDs, there is an arcane point to do with the activities of LargeDatasetBot (a mixed blessing generally). It can be a help when an ORCID offers no personal information at all, which is common enough and frustrating. In that case "What links here" may offer a handful of links from items about papers. I believe these are not made at random, by that bot. There is some JSON on Europe PMC that offers for a paper some ORCIDs for authors, and so is machine-readable and presumably reliable author-provided data.
So this route bypasses the obstruction to matching to author items with such ORCIDs, by indicating topical areas. In any case I'm now convinced. Charles Matthews (talk) 09:32, 15 November 2022 (UTC)Reply

Scientist items without any data edit

Hi Charles,

is there no further information for people like Djin-Ye Oh (Q115908326) or Hans-Rudolf Pfeifer (Q115891550) available?-- Trevor Bickford (talk) 10:43, 2 January 2023 (UTC)Reply

@Trevor Bickford: The first line of investigation is "What links here", i.e. https://www.wikidata.org/wiki/Special:WhatLinksHere/Q115908326 in the case of Djin-Ye Oh (Q115908326). The second of the papers listed there has a PMC ID on its item, and the PMC page on mouse-over of the author's name reveals "Microbiology and Hygiene, Charité University Medical Center, Humboldt University Berlin, Dorotheenstrasse 96, D-10117 Berlin, Germany." So that is one line of attack.
Also, Europe PMC offers further metadata, at https://europepmc.org/article/MED/17925445. Searching there for author information, I get
Oh DY,0000-0003-0541-6155
which supplies a search and an ORCID ID. So I can get to https://orcid.org/0000-0003-0541-6155.
Because I have the author_strings script loaded, I get a list of 11 papers potentially by Djin-Ye Oh offered to me when I'm on the item. There are also suggested papers by others that may come from co-authorship. Using that script it is often possible to fill in authorship of many papers, working locally on the bipartite graph of articles and authors.
It would be quite a mouthful to try to explain the whole picture, and which backlinks are going to be trustworthy, but it is clearly not the case that such author items are necessarily dead ends. Charles Matthews (talk) 11:04, 2 January 2023 (UTC)Reply

Check edit

Is this right? https://www.wikidata.org/w/index.php?title=Q1872556&diff=prev&oldid=1825355817 WhatamIdoing (talk) 03:13, 5 February 2023 (UTC)Reply

@WhatamIdoing: Good question. The PubMed search https://pubmed.ncbi.nlm.nih.gov/?term=Low-Level+Light+Therapy+%5Bmesh%5D&sort=pubdate shows hits with photobiomodulation in the title, so I suppose photobiomodulation is an accepted description of LLLT. Which is also what the change in the lead at en:w:Low-level laser therapy two days ago would mean. Whether this is just some sort of verbal camouflage used for a fringe treatment I don't know. Charles Matthews (talk) 06:04, 5 February 2023 (UTC)Reply
That seems to be the eternal challenge with this subject. There might be something effective here, and it's certainly being over-sold by some (because that always happens...), but which things are the real ones, and which are hype? I suspect the answer might be "ask me in 10 years". WhatamIdoing (talk) 22:23, 5 February 2023 (UTC)Reply

Henning Bovenschulte (Q118397070) edit

Hello Charles Matthews!

Who is this person? No web links, no references. --HarryNº2 (talk) 15:19, 16 May 2023 (UTC)Reply

@HarryNº2: Thank you for bringing up this point. The edit summary mentions the "author strings" gadget. This is a script for helping to convert author name string (P2093) statements into author (P50) statements. This is a very large task, with over 100 million statements to deal with.
Anyway, to answer your question, the item Henning Bovenschulte (Q118397070) is linked to by an item for a scientific paper, and that item can be found using the "what links here" button. So Henning Bovenschulte (Q118397070) was linked to by Bronchogenic cyst mimicking ischemic heart disease (Q41830892). Since the item about the paper has a PMCID (P932) statement, I can follow that link to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3519027/. As an author of the paper, Henning Bovenschulte is described as "Department of Radiology, University of Cologne, Cologne, Germany". Which will be where he worked around 2012. And I suppose he is the "Facharzt für Radiologie" shown on https://dreifaltigkeits-hospital.de/fachabteilungen/diagnostische-radiologie/team - some checking needed, but there is probably only one German doctor of that name. It is also plausible that he is the person on https://www.researchgate.net/profile/Henning-Bovenschulte. No ORCID ID, it seems.
To explain my working method, I am working down a list of items for papers that have virus (Q808) as a main subject (P921) statement. Among other tasks I try to convert all author name string (P2093) statements into author (P50) statements, for each paper. Not so easy. The "author strings" gadget works locally, on the bipartite graph of authors and papers. In order to convert author A, it is often necessary to go via authors B and C, say. I come across many authors without a Wikidata item - when the harder part of the job is to match correctly when there is some author of the same or similar name.
What results is a combination of breadth-first and depth-first approaches. I think this is suitable for opening up a very large area of work (maybe 10m authors). Arguably this is the largest single Wikimedia project ever.
So I have explained who this person is, anyway. Charles Matthews (talk) 16:07, 16 May 2023 (UTC)Reply
A source reference for the data set must at least be available, c.f. described at URL (P973). --HarryNº2 (talk) 16:16, 16 May 2023 (UTC)Reply
Wikidata:Notability: "2. It refers to an instance of a clearly identifiable conceptual or material entity that can be described using serious and publicly available references." The item passes. Charles Matthews (talk) 16:25, 16 May 2023 (UTC)Reply

UPPERCASE edit

Hi Charles,

some of your creations look very ugly with their uppercase names in all languages beyond English. Example:

Could this be fixed?-- U. M. Owen (talk) 22:19, 30 July 2023 (UTC)Reply

@U. M. Owen: This is all automation. The creations are from the mix'n'match tool, from database entries - owner Magnus Manske. In the second case there was propagation by Pibot, owner Mike Peel. Charles Matthews (talk) 05:23, 31 July 2023 (UTC)Reply

DNB edit

  Hello, I'm Eihel. I wanted to let you know that one or more of your recent description edits to Q15987216 didn't meet the Wikidata description guidelines. Descriptions should appear as though they were in the middle of a sentence, typically start with a lowercase letter, and written from a neutral point of view. For example, "pop singer" would be a better description than "He is the best pop singer." If you think I made a mistake, or if you have any questions, you can leave me a message on my talk page. Thanks!  Eihel (talk) 01:33, 14 December 2023 (UTC)Reply

Likely mistaken merge edit

Charles, on what basis did you merge Stefan König (Q116262046) and Stefan König (Q114744273)? From what I can see one of them is associated with the ATLAS experiment, and the other with CMS; they have different affiliations and external ids. Did you track down some source that confirms they are the same person? ArthurPSmith (talk) 19:51, 7 January 2024 (UTC)Reply

@ArthurPSmith: Thanks for the correction.
I was using the Distributed Game "Duplicate Authors", created for me by Magnus Manske a few weeks ago. The game itself does not explain the principle of selection. It is fundamentally use of the collaboration graph: authors A and B are suggested if there is an item C for a human and P50 statements saying A and C collaborated on a paper, and also saying B and C collaborated on a paper. There is filtering by matching the labels of A and B, for surname and initials.
The way I proceed is to look at what links to the items A and B, and judging the topics they work on by the titles of paper items linking. In cases of a probable match that is not clear, I will go to the items about papers and look at affiliations.
So I note that with physics, the papers with very large numbers of authors can throw up negatives. In general the game produces about 40% good matches with its suggestions.
I did an initial run with the game and made 1300 merges. I'm now aware of three cases where the merge was bad. Currently, I'm not using the game because I'm working on the annual update on MeSH descriptor ID (P486).
So, I'm less familiar with physics topics than with life sciences, which give most of the work. The typical trouble is with "common names", and I should take more care there.
In the bigger picture, use of the game picks up many cases with A = B, caused by faulty bot runs adding an author twice to a paper item. Looking at the backlinks also has picked up numerous bot errors from about four years ago, that used faulty identifications of authors by poor heuristics. So there is cleaning up going on there.
It is hard to know how many merges of author items are needed, but it could easily be 100K. Many bot runs simply add information about authors without any attention to the duplication issue. That is true for the creation of items from Wikipedias, and there is whole a lot of work needed for those.
I'm also trying to find sparse, possible duplicative person items: I'm developing a page at Wikidata:ScienceSource project/Sparse items. I should apologise for any incorrect merges made, but de-duplicating authors here seems to me to be a major issue that is neglected. Charles Matthews (talk) 07:04, 8 January 2024 (UTC)Reply
@Charles Matthews: Are you ok if I unmerge and disambiguate these two items then? It sounds like you should definitely be more careful with the high energy physicists - just having common co-authors definitely does not mean they are the same person, even with the same name. Affiliations need to be carefully examined in those cases! ArthurPSmith (talk) 20:55, 8 January 2024 (UTC)Reply
@ArthurPSmith: Certainly - go ahead and correct anything that's a problem. It would be easy for me just to skip physics people. I am interested in progress in that area, though, because I use the author_strings script for matching, and it is not possible to decouple the life sciences and physics there. The occurrence of many physics papers in it may be some wildcard matching, as well as the way it works locally on the author-paper bipartite graph. In any case it often loads very slowly, because of very large numbers of physics papers, typically with authors given with initials only, and the presence of those has a big impact on its usability. Charles Matthews (talk) 05:22, 9 January 2024 (UTC)Reply
I've told Magnus before to just ignore any papers with over 1000 authors in his stuff, the automatic matching is really not so useful there and we have other tools that can be used. But maybe that's not easy to do. At some point he was auto-matching people just based on surnames, completely ignoring the initials and I've had to correct probably thousands of errors that caused. ArthurPSmith (talk) 18:56, 10 January 2024 (UTC)Reply
@ArthurPSmith: Well, yes, if you go back to Source MD some years ago, I have recently been fixing errors that come from there, around 2019. I'm aware of Magnus's personal situation (I attended his marriage in lockdown Athens), and he has a remarkable overhead of software maintenance on his plate. Charles Matthews (talk) 19:05, 10 January 2024 (UTC)Reply

copyright license (P275): Public Domain Mark 1.0 Universal (Q7257361) is not correct edit

Hi Charles Matthews, I see a lot of edits like this one. The Public Domain Mark 1.0 Universal (Q7257361) is not a copyright license (P275) so these statements are not correct. Also the 4 (!) references you added don't seem to support your statement. Multichill (talk) 17:59, 9 March 2024 (UTC)Reply

@Multichill: The example you give seems quite complicated: what is said on https://europepmc.org/article/MED/22538346 is, firstly, "Articles in the Open Access Subset are available under a Creative Commons license. This means they are free to read, and that reuse is permitted under certain circumstances. There are six different Creative Commons licenses available, see the copyright license for this article to understand what type of reuse is permitted." That is text behind the symbol near the top of the page, Secondly, down the page under a link "Copyright and License information", it says "Publication of EHP lies in the public domain and is therefore without copyright. All text from EHP may be reprinted freely. Use of materials published in EHP should be acknowledged (for example, ?Reproduced with permission from Environmental Health Perspectives?); pertinent reference information should be provided for the article from which the material was reproduced.[...]". This is confusing.
I don't think the information added to the article comes directly from that page. I know it comes via the bot NCBI2wikidata, which mainly draws data from an API on PubMed, but also looks at a file that is a Europe PMC dump. That dump was (in 2019 when the bot was written) associated with the Europe PMC Open Access Subset. It was being used because for some CC licenses it is more accurate on the version number than other places.
What happens is that the output of NCBI2wikidata is in QuickStatements code, and I run it therefore in QuickStatements. It is possible to modify it in various ways before running it. For example if the issue here is that Public Domain Mark 1.0 Universal (Q7257361) is a wrong translation of a license, then the replacement can be made in the QuickStatements code.
The issue you raise looks like a bug fix, and so I'd have to look more closely at what is going on before commenting more. There is some parallel output from NCBI2wikidata for each run, which is useful for diagnostics, and I could check that.
In this case it really isn't so clear. From "Publication of EHP lies in the public domain and is therefore without copyright" it would follow that the license would be CC0. Then sentence 3 suggests they actually want CC-BY. Some translation by Europe PMC may have gone on. Charles Matthews (talk) 19:50, 9 March 2024 (UTC)Reply
Well, it turned out to be simple to run NCBI2wikidata (on "Polonium"), and the diagnostics show that Europe PMC classifies the paper with https://creativecommons.org/publicdomain/mark/1.0/. So it looks like the answer may be that this is a bug, and copyright status (P6216) should replace copyright license (P275). Charles Matthews (talk) 20:16, 9 March 2024 (UTC)Reply