Wikidata:Property proposal/MeSH Descriptor

MeSH Descriptor edit

   Withdrawn
DescriptionMeSH Descriptor
Data typeString
Allowed valuesAlphanumeric, with some commas
Example 1pancreatic cancer (Q212961) -> Pancreatic Neoplasms
Example 2thiamine deficiency (Q18971622) -> Thiamine Deficiency
Example 3bubonic plague (Q217519) -> Plague
Sourcehttps://www.nlm.nih.gov/mesh/meshhome.html
Planned useSPARQL queries pulling out these descriptors, to create search term lists and files
Robot and gadget jobsA high proportion of the descriptors could be added now mechanically from the 16 MeSH-related mix'n'match catalogs listed here.

Motivation edit

The descriptor strings in the Medical Subject Headings (Q199897) system are auxiliary, in the authority control perspective, to the MeSH descriptor ID (P486); but in terms of MeSH as a controlled vocabulary, they are really the whole point. The descriptors are the topical building blocks of MEDLINE (Q1540899) searches, very widely used to search the major PubMed (Q180686) repository, and aliases cannot be used. I have come across them as central to the project I'm currently working on.

The efforts to match the MeSH-related mix'n'match catalogs into Wikidata mean that the majority of the descriptors could be entered as statements by bot here. In contrast, https://www.nlm.nih.gov/mesh/meshhome.html is well-defended against scraping (there is a dump available in its FTP area), creating a barrier to casual data mining of them there, and we can take away the need to do that. Greater control on Wikidata of MeSH descriptors can lead to improvements in the metadata held here for medical papers, so this property can contribute to WikiCite work. Charles Matthews (talk) 15:15, 20 February 2019 (UTC)[reply]

I now withdraw this proposal. I have been using subject named as (P1810) as a qualifier on MeSH descriptor ID (P486) statements, as helpfully suggested by @Jneubert: below. It seems OK, and it would be possible to require such a qualifier as a constraint to P486. That seems better to me now, since there isn't a clear consensus after four months. Charles Matthews (talk) 16:24, 28 June 2019 (UTC)[reply]

Discussion edit

Good point. If so, there is trouble already with what was done at mix'n'match. That is normally assumed to be fair use, in that small parts only are taken from webpages for the sake of identification. But judging by the link on https://id.nlm.nih.gov/mesh/ inviting one to download all the descriptors via SPARQL, I hope that one can reasonably assume this is not a sensitive issue. Charles Matthews (talk) 19:16, 20 February 2019 (UTC)[reply]
I met someone recently appointed at the National Library of Medicine who contributes to policy around MeSH. I invited that person to speak up here if they have any comments or write to me. If they respond only to me then I will relay their comment here. Blue Rasberry (talk) 21:07, 25 February 2019 (UTC)[reply]
Thanks. There are complexities around MeSH, but further points concerning this one are (a) the existence of an open NCBI API allowing the download (a bit slow) of the MeSH term/descriptor matching; and (b) the MeSH SPARQL endpoint, which (as Magnus pointed out to me) would allow the descriptors to be freely used with Wikidata information, if it were federated with query.wikidata.org. Everything I know about this points to concerns about bots hammering webpages to the detriment of other users, implying the use of technical limitations, rather than IP issues. Charles Matthews (talk) 08:17, 26 February 2019 (UTC)[reply]
  •   Oppose To me this looks like dumping purely auxiliary data into Wikidata. A CSV table mapping Mesh IDs to Mesh descriptors would better help. This CSV could also be put on Commons. -- JakobVoss (talk) 10:25, 26 February 2019 (UTC)[reply]
I find several reasons to disagree with the comment. The MeSH system is not a stable set of identifications: it is updated every year. Files that are created without allowing for maintenance, in the way suggested, become obsolete, something we see with BEACON files in the GLAM context. I have argued the "auxiliary" point above (it means "helpful", of course). And full text search exists here (e.g. https://www.wikidata.org/w/index.php?search=%22Infections%22&title=Special:Search&profile=advanced&fulltext=1), so that where the MeSH term is rather different from the label (happens commonly), the text string will show up in other searches. Charles Matthews (talk) 21:14, 26 February 2019 (UTC)[reply]
Updates need to be checked against the official MeSH data: download a data dump, extract headings (and possible entry terms as aliases for search expansion), that's it. If I understand your proposal you want to inject this concordance from MeSH ids to MeSH headings into Wikidata just because it is easier to have all data in Wikidata? -- JakobVoss (talk) 08:21, 27 February 2019 (UTC)[reply]
My premise is actually that SPARQL is now commonly used in the Wikidata community. In terms of SPARQL queries at query.wikidata.org, a file giving the matching of Wikidata items with the MeSH strings would be easily created and downloaded in its current form. (One should note that the single-value database constraint for MeSH descriptor ID (P486) is heavily violated here, see Wikidata:Database reports/Constraint violations/P486#Single value – one more MeSH complexity.) It happens that the ScienceSource project on which I'm working did download the dump about six months ago, for work on MeSH tree code (P672). I didn't do that: it requires developer skills. and saying "that's it" rather assumes those skills. Having the data in Wikidata makes it easier to export it from Wikidata, in whatever is the required form. I'm working with MeSH-based searches of a PubMed API right now, and their power is impressive. I believe I'm simply following the basic Wikimedia logic, that lowering barriers to reuse of information is good. Charles Matthews (talk) 20:38, 27 February 2019 (UTC)[reply]
Not as shown on this page, Use of MeSH in Online Retrieval Charles Matthews (talk) 20:38, 27 February 2019 (UTC)[reply]
Looks like they are involved with a restrictive license. Charles Matthews (talk) 20:38, 27 February 2019 (UTC)[reply]

@JakobVoss: @Jneubert: I'm not someone who believes in parsimony for properties, as a good thing in itself. But let's look at the subject named as (P1810) suggestion on an example, headache (Q86). Certainly, putting in the names (as I have just done) clarifies why the single-value constraint is being violated: "headache" and "headache disorders". There is another qualifier in play, mapping relation type (P4390), which is a sort of precedent here, but if references are being added, can be confusing. The MeSH page https://meshb.nlm.nih.gov/record/ui?ui=D006261 actually suggests "headache" should take precedence over "headache disorders" in diagnosis. For MeSH tree code (P672), the tree number C10.597.617.470 is not actually present currently on https://meshb.nlm.nih.gov/record/ui?ui=D006261, though its link rather suggests it may have been in the past.

It is quite troublesome, might induce a headache?

I'm not saying the suggestion is wrong, but as an intervention in some data modelling that already exists, it requires due consideration. The queries to extract the descriptor would be a bit more complex, but I'm not saying that this is a fundamental reason. Charles Matthews (talk) 10:09, 6 March 2019 (UTC)[reply]

Replies to these points. "MESH is available as RDF dump, so there's absolutely no need to scrape it from somewhere." I just disagree with that, and have made the argument above in terms of lowering the barriers to usage. MeSH is important enough for search, as used by medical people with little technical background, that I believe the argument is valid. w:MEDLINE#Retrieval describes in general what "MeSH terms", i.e. these strings, do for you, and w:Template:Reliable sources for medical articles seems to use a synonym workaround because the "MeSH term" may not actually be the enWP page name, though in many cases it will be.
For MeSH descriptor ID (P486), I think we should not be using the M-prefix terms, and in fact the C-prefix ones are not very useful in terms of search, though perhaps they should not be excluded. The property proposal showed only D-prefix terms, and the regex could have been set up that way. So I can agree with your suggestion to divide off the M numbers from the rest.
I agree about the need to "clean up the existing mess", and in fact the metadata tool I'm working with now produces long lists of duplications of D-numbers. Each case involves a careful look at the scopes. But why would the text property obstruct that work? Charles Matthews (talk) 14:12, 17 March 2019 (UTC)[reply]
      • @Charles Matthews: "there's absolutely no need to scrape it from somewhere" means that we should import MESH RDF to WD (after proper transformation), no need to scrape it from websites.
"why would the text property obstruct that work?". MESH has excellent ids for both descriptors (D), chemicals (C); and finally concepts (M) that carry single labels. The purpose of WD external properties is to link to databases: IDs can fulfill that purpose but strings cannot. So what is the purpose of the string you are proposing? Are your proposed strings same or different from the (M) concepts? Can you give examples? --Vladimir Alexiev (talk) 15:43, 23 March 2019 (UTC)[reply]
@Vladimir Alexiev: Perhaps I should clarify some things first. I am proposing a string property. As a matter of process I proposed it in this "authority control" section because it seemed to the sensible place to have this discussion about MeSH. I am not proposing an external link property, of course. Normally, here, "authority control"="identifier in a database", to be added to a formatter URL. That is quite true. But some definitions might help.
What MeSH is, is not primarily a database identifier. w:Medical Subject Headings states that "Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it serves as a thesaurus that facilitates searching." I am proposing that we make this controlled vocabulary available on Wikidata, as a set of strings. Then, via SPARQL, it can be used for example to "facilitate searching". The association of strings with Wikidata items is needed for that. To go via an identifier at all is "two sides of a triangle".
To quote a definition from w:Authority control, "In library science, authority control is a process that organizes bibliographic information, for example in library catalogs by using a single, distinct spelling of a name (heading) or a numeric identifier for each topic." So a controlled vocabulary can be doing "authority control" by attaching strings to headings, to remove aliasing. Charles Matthews (talk) 21:32, 23 March 2019 (UTC)[reply]
@Charles Matthews: Are you proposing some string that's different from the labels of MESH Concepts (M codes)? If different, what is it? If same, why would you want a string rather than a controlled ID? --Vladimir Alexiev (talk) 15:25, 25 March 2019 (UTC)[reply]
@Vladimir Alexiev: Let's refer to MeSH Record Types. There "Descriptors" are divided into four classes. My main interest is in Class 1 Descriptors - Main Headings because they are "are used to index citations in NLM's MEDLINE database, for cataloging of publications, and other databases, and are searchable in PubMed..." Then Class 2 Descriptors - Publication Characteristics (Publication Types) are also very useful for metadata work. I believe you are referring to the Supplementary Records that are "also called Supplementary Chemical Records(SCRs), [...] used to index chemicals, drugs, and other concepts". For a Descriptor entry such as https://meshb.nlm.nih.gov/record/ui?ui=D000082, there is a Concepts tab that leads to links such as Acetaminophen Preferred with Concept UI M0000115. I don't want the string "Acetaminophen Preferred", I want the Descriptor string "Acetaminophen". Which of those is your label?
This is all rather laborious, but we do seem to be talking past each other here.
Why would I want the string "Acetaminophen"? Because it is "searchable in PubMed", for one good reason.
And why am I making this proposal now? You are active on Github, I believe and the project I'm working on uses the tool https://github.com/ContentMine/NCBI2wikidata. Its README mentions the auxiliary tool GenerateMeshTerms, and this actually starts off the whole toolchain for the ScienceSource project. GenerateMeshTerms calls an API that is rate-limited to about two hits per second, to translate the MeSH descriptor ID (P486) into the string I'm talking about; which is then used to search PubMed. There wouldn't be an API if this were a useless idea, and it wouldn't be so severely rate-limited if there weren't many people wanting to do that translation. We can have the controlled vocabulary here, and avoid an unnecessary double translation Q-number -> MeSH ID -> MeSH term. Charles Matthews (talk) 16:11, 25 March 2019 (UTC)[reply]
@Charles Matthews: Now take a look at the LOD representation produced by NLM: http://id.nlm.nih.gov/mesh/T000213 is the term you want, which contributes to http://id.nlm.nih.gov/mesh/M0000115, which contributes to http://id.nlm.nih.gov/mesh/D000082. So in this case you want T000213 Acetaminophen.
I don't want the string "Acetaminophen Preferred", I want the Descriptor string "Acetaminophen": hope you're not pulling my leg here. Of course, the label is "Acetaminophen", whereas "Preferred" is its status (T000213 is the preferred term for concept M0000115, which itself is preferred for descriptor D000082). The two are kept in different fields.
Why would I want the string Acetaminophen: well then, ask someone to write you a SPARQL query to extract all preferred terms of preferred concepts of descriptors. (Ontotext could do such semantic integration or querying work for you).
GenerateMeshTerms calls an API that is rate-limited to about two hits per second. There wouldn't be an API if this were a useless idea, and it wouldn't be so severely rate-limited if there weren't many people wanting to do that translation. I never said it was useless data. I said taking just the string when there are perfectly good IDs at all levels of the MESH data hierarchy (descriptor D&C, concept M, term T) is folly. There's absolutely no reason to record just the string in WD, losing the structure that the good people at NLM have carefully built. They provide an RDF dump, so it'd be easy for us to transform it into any other desired form, and in processing this RDF, we aren't going to hit any rate limits.
I wasn't only taking of Chemical (SCR) descriptors. MeSH descriptor ID (P486) caters for all classes of descriptors. The problem is that there's only one WD property when in fact MESH has 3 levels of data (D&C, M, T), and the links that P486 generates for M and T are wrong. Should I propose two more props "MESH Concept" and "MESH Term"? --Vladimir Alexiev (talk) 08:06, 26 March 2019 (UTC)[reply]
@Vladimir Alexiev:. "Folly". No, it's not folly, it is for project work I am engaged in, and in particular to open it up to have a wider, easier scope, by having more data here on Wikidata. "...absolutely no reason..." Sorry, that sounds to me like you think you have a veto here, and "easy for us" is not an inclusive way of putting it. (Why do you say "lose the structure"? ScienceSource added 11K MeSH Code identifiers here, last year, and I'm well aware of the tree structure. But it is not the only useful thing about MeSH.) "MeSH descriptor ID (P486) caters for all classes of descriptors." Yes, you have made your point here. But (a) it is off-topic in this actual discussion, and (b) I have been agreeing with you: because the "all classes" interpretation depends on the regex constraint, which can and I think should be changed for clarity. Why not have more MeSH properties? It is an important, complex system. I'd support a MeSH Concept property proposal.
I'm not sure what to make of your mention of w:Ontotext. Charles Matthews (talk) 08:27, 26 March 2019 (UTC)[reply]
Please see Wikidata:Property proposal/MESH Concept ID and Wikidata:Property proposal/MESH Term ID and vote for those proposals.
@Charles Matthews: To explain my position again: MESH is a 3-level system. Your proposal would effectively skip these levels and jump directly to the string. But Concepts and Terms have other useful characteristics beside the string. If you have a link to Tnnnn, you an pick up its level mechanically but not the other way around. "Easy for us" means for anyone who can handle data (such as the company I work for, but many other people as well). Yes, I'd like to have more MESH properties in WD, but we should have the "right" properties. WD links to over 2000 databases: not to the names of items in those databases, but to ids: that is for a good reason. --Vladimir Alexiev (talk) 12:41, 31 March 2019 (UTC)[reply]
So you have explained your position again: seems we have a loop here. What you have not done, really, is to address the "planned use" in the proposal. The "names" are not just some kind of label: they are the actual strings of the controlled vocabulary, needed to construct search terms. They are not at all arbitrary, as Wikidata labels in effect are. "If you can search using MeSH entry terms instead of keyword searching you can focus your search and find more relevant citations."[1] Charles Matthews (talk) 16:34, 31 March 2019 (UTC)[reply]
Further development: Vladimir has changed [2] the English label of P486 to "MeSH Descriptor ID". That makes the discussion above confusing, to say the least.
To reiterate what was said above: https://www.nlm.nih.gov/mesh/intro_retrieval.html is an official page of the w:United States National Library of Medicine. Its topic is "Use of MeSH in Online Retrieval". What is used in online retrieval is the "MeSH vocabulary" of strings. If a human has the MeSH descriptor ID (P486) value then that gives a link to a page with the string on, and the human can read the string; but for machines, the MeSH pages are subject to anti-scraping measures.
Conclusion: the so-called "MeSH Descriptor ID" is not what is helpful for web scraping of the MeSH vocabulary. The point of the proposal is to have a copy of the vocabulary here, which will then be easily machine-readable. For example, to search it for the string "neoplasm" would be a simple SPARQL exercise.
The counter-suggestion that a download from the MeSH site's FTP area would provide the same information neglects the issue of "barriers to entry": it requires much more know-how. There is an accepted strategic aim: "Knowledge as service: Wikimedia will be a platform that serves open knowledge to the world, across interfaces and communities". See m:ESEAP Conference 2018/Program/Wikimedia 2030 Movement Strategy.
I have heard nothing here to change my view that this proposal is squarely in line with that aim. "Open knowledge" that requires developer tools is not the same thing. We are talking here about information retrieval from the medical literature, and lowering the barriers to entry. I have nothing against the rationalisation of the existing MeSH properties, but my experience in the work reported on at User:Charles Matthews/NCBI2wikidata leads me to see that control of the MeSH vocabulary has an importance well beyond simple-minded "authority control". Charles Matthews (talk) 11:12, 13 April 2019 (UTC)[reply]