Wikidata:Property proposal/research status

research subject recruitment status edit

Originally proposed at Wikidata:Property proposal/Natural science

Motivation edit

I am proposing this property to complement the 300,000 clinical trials in Wikidata from ClinicalTrials.gov (Q5133746) which have ClinicalTrials.gov ID (P3098).

The easiest way to tag research projects is when they are complete, because then we do not have to worry about updating them. If there are 300k clinical trials then I hope that 250k are done, so if we tag those then we have completed stable data. Those completed trials are interesting to demonstrate what experience institutions and individuals have in doing certain kinds of research, and they also tell stories of sponsorship and investment by region or social network.

I expect that most users will be more interested in contemporary research. A common request which I have heard for the past 20 years is how any patient can find relevant clinical trials anywhere, then refine the results to see clinical trials actively recruiting as close as possible to their geographic location. We still have development to do, like getting a list of research sites and disambiguating their names. For a given hospital or medical school there might be 100 research sites, and maybe 20% of those will account for 80% of the research. Wikidata has not started collecting these but I think we will, because these sites interconnect people, research projects, funding, sponsors, research publications, a entire regional influence around certain sectors of research.

Other interesting insights here are surfacing the anomalies. Lots of papers talk about this, like for example, Discrepancies between ClinicalTrials.gov recruitment status and actual trial status: a cross-sectional analysis (Q42650121). The general situation is that the United States Federal government mandates that medical researchers and pharma companies register their research in clinicaltrials.gov, but often they companies are careless about accuracy or keeping their entry up to date. I expect that Wikidata is capable of exposing which companies are worth more than US$1 billion yet fail to support secretaries to keep their filings up to date. We can see about that also; significant errors might exist in 20% of clinical trials. Interesting errors could be trials which should have ended but where the company never closed out recruitment or weird implausibility with start and stop dates related to enrollment.

Matters of general interest could be identifying what sorts of clinical trials are most likely to get withdraw, which ones have the longest enrollment terms, or clustering patterns of behavior in recruitment for various sorts of trials, or by certain companies, or in different regions.

I do not have a plan in place to keep trial recruitment status up to date. AACT Database (Q76654384) of the Clinical Trials Transformation Initiative (Q76654258) updates daily, converting the problematic data structure of ClinicalTrials.gov into data which is much easier for Wikidata to ingest. Wikidata's clinical trials data has come from this thanks to Duke University and user:Tibbs001, and this user also maintains that database. I want to try to profile active or more research from Duke University since they made this data available along with my school University of Virginia, and Vanderbilt University in the region. I will explore doing some profiling of clinical trials at some universities in India also to experiment with language translation in medicine.

Although I think that ClinicalTrials.gov defines this property best, this property and these statuses could be used for any sort of research on humans. I am imagining psychology research, surveys, or user experience research. If someone wanted to adapt this property for other research in the future then I see that as an option for exploration. ClinicalTrials.gov calls this field "recruitment status"; to keep this a bit more open for the possibility of reuse, I called this property "research status". If there is a problem we could name it "recruitment status" directly after C T. gov.

Blue Rasberry (talk) 20:18, 27 November 2019 (UTC)[reply]

  Notified participants of WikiProject Medicine Blue Rasberry (talk) 21:08, 15 December 2019 (UTC)[reply]

Discussion edit

  Done. I also changed the label of the proposed property from the rather ambiguous "research status" to "research subject recruitment status", since this is closer to the way the proposal was framed. Like User:Bluerasberry above, I could also imagine such a property to be useful in other contexts in principle — think "What's the recruitment status for non-medical research projects?" or an even broader property "recruitment status" for things like "What's the recruitment status for role X in company Y?" — but as far as I can see, there is very little reliable information about such other potential uses, so we might as well stick with clinical trials and thus "clinical trials research subject recruitment status", for which the information is at least somewhat reliable. --Daniel Mietchen (talk) 01:54, 30 November 2019 (UTC)[reply]
Thanks, "research subject recruitment status" seems cool. Alternatives could be "research participant recruitment status", "participant recruitment status", or "subject recruitment status". Blue Rasberry (talk) 19:03, 10 December 2019 (UTC)[reply]
@Pintoch:
About temporal qualifiers -
I prefer to avoid creating new temporal qualifers in Wikidata, and instead to only import the official ones from ClinicalTrials.gov (Q5133746) as the only authoritative dataset. Even in that only dataset, there are more temporal qualifiers than I think Wikidata needs right now. I am not seeking to replicate that dataset, but only to develop the parts of it which interlink well with current Wikidata applications. Too much time data is out of scope.
If we imported temporal qualifers there are two sorts which I consider right now. One is "status as of a certain date, probably time of upload". So if a study has a certain status now, we could say that Wikidata copied that status as of that time. I prefer not to include this because I hope that if anyone imports data without a temporal qualifer, then it is up to date at time of upload, and anyone could find that in the Wikidata history log. The other use case that I am imagining is temporal modifiers to note milestones in research study status change. These are more interesting in the long term, and also these are reported fields in the original dataset so we would not need to create new data for Wikidata as in the other case. However, a typical study might have 5 of these temporal status changes, and right now we do not have a use case for this. If anyone wants that data it is available for import, but right now, for completed or terminated trials, I want to tag them as finished because that ought not change.
A very interesting use case for active trials, which would require temporal modifiers, is for anyone to query for active trials on a medical condition, with an intervention, and by region. People, including patients and researchers, spend lots of money in various commercial services trying to identify clinical trials active now for a certain condition. I would like to pilot a few organizations for the clinical trials which they have active now, and in doing so, develop best practices about temporal qualifiers for status changes.
About "unknown" -
I am open to hearing about a common term or process for expressing the concept "unknown", but my first thought is that this should be its own item. With regard to clinical trials, "status=unknown" carries a connotation with many social implications. If a trial has status unknown, then that means that a researcher had reporting obligations, but then fell out of communication, and also the situation was strange enough that an audit identified the lack of communication. Entangled in this is probably dubious management of a large sum of money, $US2 million or more perhaps, and community volunteers have their body fluids as donations to science in some kind of limbo which is problematic. The part of "unknown" which is special for clinical trials is that it communicates a designation after some third party review, and is not either a lack of reporting or a self reported status like the others. Blue Rasberry (talk) 19:02, 10 December 2019 (UTC)[reply]
@Bluerasberry: just to double-check, since you do not mention them in your reply: are you aware of Help:Statements#Unknown_or_no_values? − Pintoch (talk) 19:59, 29 January 2020 (UTC)[reply]
@Pintoch: I am not familiar with that, and I want guidance, but I think it does not apply.
The "unknown value" in this case is not because of lack of information, but because of a structured data designation of "unknown" which has a particular definition and follows a set process of ClinicalTrials.gov. There could be two kinds of unknowns in play here. "Unknown", as in lack of data, and "Unknown", as in officially designated by the authority to be missing. What is your opinion? I would go with whatever seems appropriate on anyone's advice. Blue Rasberry (talk) 20:08, 29 January 2020 (UTC)[reply]
Intuitively, the fact that unknown status is asserted by the authority should be conveyed by the reference you put on the statement, so I would say that Wikidata's built-in unknown value feature should be appropriate. − Pintoch (talk) 07:38, 30 January 2020 (UTC)[reply]
@Pintoch: like this - special:diff/1107355927? Seems reasonable, let's do it. Blue Rasberry (talk) 14:54, 31 January 2020 (UTC)[reply]
@Pintoch: Now what about all the others in "allowed values"? They are repeats just like "unknown", right? I made an item for "completed" which refers to "completed according to the authority which is ClinicalTrials.gov". As we import this dataset, I am not aware of publicly available internal datasets, but with these central registries, there could be "complete according to the university doing the research" and "complete according to the central registry to which they later report after stopping locally". I suppose we never need these specific designations, right, and we always use the general Wikidata concepts? Blue Rasberry (talk) 14:57, 31 January 2020 (UTC)[reply]