Wikidata:Property proposal/cites work string

cites work string edit

Originally proposed at Wikidata:Property proposal/Creative work

   Not done
DescriptionWhen you add the statement "cites work", quite often there is no Wikidata item to be found for a paper, a book. The new property enables the inclusion of the title of the cited paper as a text, add as a qualifier a "series ordinal", maybe a "DOI".
Data typeString
Example 1? → "Is the loss of Australian digging mammals contributing to a deterioration in ecosystem function?"
qualified with series ordinal (P1545) "91", DOI (P356) "10.1111/mam.12014"
Example 2? → "Emerging frameworks for understanding and mitigating woody plant encroachment in grassy biome"
qualified with series ordinal (P1545) "68", DOI (P356) "10.1016/j.cosust.2018.04.005"
Example 3? → "Do dung fungal spores make a good proxy for past distribution of large herbivores?"
qualified with series ordinal (P1545) "18", DOI (P356) "10.1016/j.quascirev.2012.11.018"

Motivation edit

This allows readers of the information of the item to have a more complete understanding of the work. It also enables a bot to pick up on available DOI and make Wikidata more complete. GerardM (talk) 06:15, 26 November 2020 (UTC)[reply]

Discussion edit

  •   Support --Jeb (talk) 07:15, 26 November 2020 (UTC)[reply]
  •   Support -- I think this would be great, but I think it needs to be more general. That is, I tried doing this as a crude experiment here: Ridleyandra merohmerea (Gesneriaceae), a new species from Kelantan, Peninsular Malaysia (Q42258926). It's a hack of cites work (P2860), but you can see citations such as "Kiew R (2009) Three new species of Gesneriaceae from Kelantan, Malaysia. Gardens Bulletin Singapore 61: 73–79". At some point, references like this will be in Wikidata, and so Likewise, because citation linking typically only happens when a work is added to Wikidata (e.g., via CrossRef) we can lose a lot of citation data if few of the references cited are in Wikidata (i.e., even if those reference have DOIs, if they aren't in Wikidata they don't get linked). --Rdmpage (talk) 10:34, 26 November 2020 (UTC)[reply]
  •   Question If there is a title and some identifier (e.g. DOI), why not create an item for it? --- Jura 10:36, 26 November 2020 (UTC)[reply]
    • That can be done, although it assumes that the tools used to add a work will be clever enough to call themselves to add each missing DOI (and then if you add those references, why not add the missing DOIs they cite, etc.). There will also be cases where the cited reference has a DOI but the author and/or publisher are unaware of that DOI (or it has been added since the work was published). --Rdmpage (talk) 10:44, 26 November 2020 (UTC)[reply]
    • The three (incorrectly formatted) samples provided for this proposal include DOIs. --- Jura 10:47, 26 November 2020 (UTC)[reply]
      • Sure, which is why I've suggested expanding the proposal to take the original citation string (which may or may not include a DOI). I think the issue here is how we crawl the academic graph. If we add everything in date order, then any DOI cited by a new work will already be in Wikidata. But we don't. --Rdmpage (talk) 10:51, 26 November 2020 (UTC)[reply]
        • When your source is a PDF (as was mine) you do not have a title that is exactly the same as what is on the DOI. With some software, we can check if the DOI exists.. So a tool with a DOI as an input would also work wonders in limiting the amount of effort it takes and it will improve the quality of the work considerably. Thanks, GerardM (talk) 11:03, 26 November 2020 (UTC)[reply]
      • Adding 37 new items is too much to ask for for just one paper (as mentioned in my blog). Just adding the 54 papers to the one paper took me a week of the time I had available. People have limited availability and in this way it is . Thanks, GerardM (talk) 10:58, 26 November 2020 (UTC)[reply]
        • Assuming you do check if we already have items before adding the string version, why should it take more time to create 37 items with a DOI than to add 37 strings with a DOI as qualifier? --- Jura 11:13, 26 November 2020 (UTC)[reply]
          • You're comparing adding 37 statements to one item, versus adding 37 items (each with multiple statements). One is faster than the other, surely? --Rdmpage (talk) 11:23, 26 November 2020 (UTC)[reply]
            • If the information is the same, why should it be? --- Jura 11:26, 26 November 2020 (UTC)[reply]
              • Not sure that I follow. A given CrossRef record for a DOI typically contains a list of cited works that are either represented just by a DOI, or by a string. If you have the DOI, that's enough to locate the item in Wikidata (another API call) and make the cites work (P2860) link, but if the item doesn't exist you'd need another call to CrossRef to get the data you need to add that item. Worst case scenario, a reference with 10 DOIs none of which occur in Wikidata will need 10 calls to Wikidata and 10 calls to the CrossRef API before you can add the original item. --Rdmpage (talk) 14:06, 26 November 2020 (UTC)[reply]
                • I don't see why you would check another database after you did check Wikidata that there is no preexisting item. You just run a batch that adds the information to Wikidata in one way or the other. --- Jura 14:14, 26 November 2020 (UTC)[reply]
                  • I think we getting somewhat off topic, but it's not quite as simple as you "run a batch". If you want to build the citation graph as you add the items, then the order in which you add items matters - you can only link to a reference cited if that reference is already in Wikidata. There are lots of references in Wikidata that cite other items but Wikidata doesn't "know" that because the older, cited papers were added more recently than the younger paper that cites them. --Rdmpage (talk) 18:16, 26 November 2020 (UTC)[reply]
          • ──────────────────────────────────────────────────────────────────────────────────────────────────── yes, back to topic: it's sufficient to create a new item for each title (including an identifier) and add these as value to P2860. From the contributor who checks for existing items, this wont require any additional effort when done with a sensible tool. The items can then be completed further. --- Jura 09:10, 27 November 2020 (UTC)[reply]
            • So let me get this straight, given the choice between adding string properties to a single item, and creating multiple new items (based on as little information as a title) you prefer the later? Given the variation in how people cite the same article (and how publishers capture those citations I their metadata), won't this have the result that we fill Wikidata with lots of mostly empty items that are the same thing? And how do those of us adding complete references ensure that we aren't duplicating an item that already exists (if only as a title)? It's already a challenge avoid duplicate entries when creating new items, this suggestion seems to make things worse. I think we all want the same thing: every bibliographic reference with an Wikidata item, all citations between them represented by cites work (P2860). The issue is how we get here, and the spirit of the original proposal seems to me to be to try and help get us there more quickly. --Rdmpage (talk) 11:40, 29 November 2020 (UTC)[reply]
              • I think my approach matches better the usecases presented as samples above. --- Jura 13:12, 29 November 2020 (UTC)[reply]
                • in the work flow there is no room for adding items. A bot may pick up on a DOI and add missing items et al. However, titles are all too often problematic because they are not exactly as on the paper itself. As Roderick indicates many of the papers do not have a DOI and require TLC for them to be added. In the mean time, with "cites work string" there is a representation of all the citations. Your approach prevents that from happening. Thanks, GerardM (talk) 13:29, 29 November 2020 (UTC)[reply]
  •   Support. Just concerning adding DOIs as qualifiers, this is absolutely useful for an easy creation of missing Wikidata items having DOIs and for the substitution of cites work string statements by corresponding cites work statements. This can be handled by OpenCitations API without any significant human effort. Please see https://w3id.org/oc/index/api/v1/metadata/10.1108/jd-12-2013-0166__10.6084/m9.figshare.3443876 as an example. --Csisc (talk) 11:25, 26 November 2020 (UTC)[reply]
  •   Oppose. For most properties, if there is no wikidata item, we set property value = somevalue with qualifier object named as (P1932) = <string>. I think that that is a better way forward. No objection to using identifiers like DOI (P356) for statements with value somevalue. Jheald (talk) 12:57, 26 November 2020 (UTC)[reply]
  •   Oppose like Jheald – I think unknown value Help makes more sense here. You’re proposing to qualify the statement with a DOI anyways, so we can just as well add the title (P1476) as a qualifier too (and page(s) (P304) etc.). --Lucas Werkmeister (talk) 13:06, 26 November 2020 (UTC)[reply]
  •   Support I like the proposal in general. It reminds me of Crossref's unstructured_citation property for citations "for which no structured data is available". As I see it, it may be the literal citation string as it appears in the references section of a journal article. As discussed on WikiCite's Telegram group, I see it as analogous to author name string (P2093), used for cases where "Wikidata item for author (P50) does not exist or is not known". I like the idea of having DOI as qualifier to make it easier to create or map to a Wikidata item later. In fact, I wonder why the author name string property doesn't have a ORCID or similar qualifier, analogously. Finally, although I like Rdmpage workaround, I acknowledge his concern about Quickstatements not liking to add multiple instances of unknown value Help to the same property. --Diegodlh (talk)
    • The information CrossRef has for a citation ranges from a DOI, an unstructured citation string, or structured data such as an ISSN, a volume, and a start page (suitable for forming an OpenURL query), so ideally we could incorporate all these cases, which would make life easier for any bot coming along to try and resolve these citations to Wikidata items.--Rdmpage (talk) 20:12, 26 November 2020 (UTC)[reply]
  •   Oppose Agree with Jheald and Lucas Werkmeister: some value + qualifier works (as demonstrated in Q42258926#P2860) looks great to me. That QuickStatments does not play well with this construct sounds like a feature request for QuickStatements. Jean-Fred (talk) 09:46, 27 November 2020 (UTC)[reply]
  •   Oppose adding dumps of large chunks of unformatted text to items: either with this proposed property or unformatted with qualifiers to "somevalue". Wikidata is not a free text database. --- Jura 09:55, 27 November 2020 (UTC)[reply]
    • I wouldn't characterise it as "large chunks of unformatted text", often the information is structured, it's just that we don't the corresponding Wikidata item yet. For example, here are two references in CrossRef metadata:
      {
        "key": "20344_B4",
        "DOI": "10.3897/phytokeys.25.5178",
        "doi-asserted-by": "publisher"
      },
      {
        "key": "20344_B5",
        "first-page": "125",
        "article-title": "Two new species and one new subspecies of Ridleyandra (Gesneriaceae) from Peninsular Malaysia.",
        "volume": "66",
        "author": "Kiew",
        "year": "2014",
        "journal-title": "Gardens’ Bulletin Singapore"
      }
    • One has a DOI which makes it easy to link to discover if a Wikidata item already exists (which is likely the car eif it has a DOI). The other is more of a challenge, there is enough information to help a bot make a link when and if that reference gets added. As it stands we lose information every time we add a reference that has citations that lack DOIs. Note that also that unless we do something like this we have no way of knowing how complete the citation links are. If a item as one cites work (P2860) does that man it only cites one paper, or does it mean it cites 10 papers, only one of which has a DOI? At the moment we can't tell. --Rdmpage (talk) 16:19, 27 November 2020 (UTC)[reply]
      • It's unformatted because it's added as a single concatenated string value instead of being mapped to properties we usually expect. The removal of formatting leads to a loss of information and prevents other users to build on the information. The later is the main objective of this project. --- Jura 22:50, 27 November 2020 (UTC)[reply]
  •   Comment In general I agree with the suggestions about to use "somevalue" with qualifiers; however the Quickstatements problem is one that hopefully can be addressed somehow. Though Quickstatements is probably the wrong tool to use to add such things to a Wikidata item anyway, since it will make 1 (or more) edit per statement, as opposed to a single edit a bot could make adding all the citation data. More importantly, the proposal doesn't seem to be defining what string is intended to be put in as value. From the examples it looks like only the title of the cited work? I would prefer, if we do something like this (or use "somevalue" with a "stated as" qualifier) that the entire citation string, as present in the original work, be placed. Also I think the property, if created, should be labeled "citation string", not "cites work string", as a single citation may refer to multiple works, or none at all directly. ArthurPSmith (talk) 22:19, 27 November 2020 (UTC)[reply]
  •   Oppose per James and Lucas. Mahir256 (talk) 05:57, 29 November 2020 (UTC)[reply]
  •   Comment It seems this discussion has reached an impasse. On the one hand we have people who would like to (a) add unstructured citation data they consider to be useful (b) using tools (manual editing, Quickstatments) that they are familiar with. This means we can make progress now, at the cost of adding unstructured data that we hope to convert to items in the future (as we currently do with author names). On the other hand we have people (a) concerned by the addition of unstructured data to a structured data project, and who (b) propose solutions (e.g., bots, improvements to Quickstatements) that are either outside the competence of the people interested in the proposed property, or present a road block to making progress now. I suspect this is not the first time this sort of split has happened: one side want to do it now with existing tools with a promise to clean it up later, the other side want it clean to start with, using tools that don't yet exist. Does anyone see a way forward? --Rdmpage (talk) 12:54, 29 November 2020 (UTC)[reply]
    • The solution I proposed should be doable with QS. Occasionally, we try approaches with free text (e.g. Property:P7535), but that didn't quite work out (at least that's my view about values like [1]). P7535 has the advantage that we just get one such statements per item. Maybe Commons could take them as data pages? --- Jura 13:10, 29 November 2020 (UTC)[reply]
      • I'm not clear what your QS solution actually is, nor how it helps someone like GerardM add citations strings by hand (which I think was the original motivation for this proposal). And how does offloading this to Commons help? --Rdmpage (talk) 13:49, 29 November 2020 (UTC)[reply]
      • In the paper I worked on there are some 90 citations. Two thirds of them have an item in Wikidata, those with and those without an item are mixed. When a string is replaced with an item, the string is to be removed. When we are lucky there is a DOI, often there is not. Each string in a paper has identifiable parts, typically they comply to standards. These parts can be used as qualifiers to aid in identifying papers. We should not waste time of people and use a tool like QuickStatements. The use of Commons is inappropriate because it makes it confusing and does not aid processes nor workflows. I sincerely do not understand the points made as they do not consider workflows nor processes. Thanks, GerardM (talk) 15:28, 29 November 2020 (UTC)[reply]

I do not get it edit

What I propose is part of a workflow. The workflow for "scholarly papers" probably the biggest subset within Wikidata. Part and parcel of that workflow is the property "Author name string". It performs the exact same task as the property proposed but for authors. It completes the data for a scholarly article where Wikidata is not complete. There are two tacks in this workflow; automated and manual additions. Bots add citations for a given paper where typically for papers we already have in Wikidata. This easily enables a similar process, a process that looks up missing papers from a Crosref an ORCiD and replaces the "cites work string" with the "cites work" property. This process will have a big impact in what Scholia displays for authors, papers and subjects.

The other workflow is where individuals like myself find missing items in Wikidata and add "cites work string". This is a lot of work this will likely be done for individual papers only. When enough identifying information is given, it is likely that the bot process described above will pick up on these statements. The notion that we will gain more incomplete data is valid however, now we hide how incomplete Wikidata is. For me the worst level of quality is level zero, the data we do not have but should have given the data that we hold.

Some consider all kinds of theoretical notions, the fact that a bot COULD pick up on data, the fact that an item COULD be created. They burden others with the consequences. Adding all "cites work" statements is a lot of work, I do not volunteer for more work. There is no bot that DOES pick up on this kind of data. So the net effect of opposing this proposal is that we will not get better data and remain mediocre. I am totally in favour of automating the process whereby a DOI can be given as a parameter to a process that picks up on all the data enabling the inclusion with "cites work". That expects that all manual data is correct.. We can do better and that is why this proposal. Thanks, GerardM (talk) 07:55, 29 November 2020 (UTC)[reply]

Update edit

I have worked on one paper and find issue with the proposal. First I need a true placeholder, the one suggested in combination with "Mix'nMatch" does not work for me. It is replaced by a text and consequently I have to remove my statement in stead of replacing it with what it reserves the place for. It follows that the statements are no longer in order of "series ordinal".

I am happy with the fact that a subset of "SourceMD" is still available. The problem with that suggestion is that it only works with Crossref DOI.

For me this is a recurring workflow; I do it manually and I prefer for a bot to do most of the work. When I am done adding a placeholder with a DOI, additional work needs to be done. At least we should look for the paper at ORCiD and replace "author name string"s with "author" statements.

  • I am looking for a placeholder that is not replaced by a string
  • I want to trigger processes that enrigh data for an author, for a paper, for a subject.

Thanks, GerardM (talk) 08:44, 3 December 2020 (UTC)[reply]

  • It seems your approach isn't compatible with the way Wikidata was defined. Commons might be able to accommodate randomly formatted data pages. --- Jura 09:09, 3 December 2020 (UTC)[reply]
    • Really? How is the data randomly formatted, it is anything BUT random. Check the work that I have done. What is left are publications without a DOI. Thanks, GerardM (talk) 11:22, 3 December 2020 (UTC)[reply]
      • I think "author name string" isn't used correctly and you seem to add qualifiers that aren't expected at the place you add them. Anyways, what prevents you from using the "create a new item" buttom or a corresponding cradle template? --- Jura 11:30, 3 December 2020 (UTC)[reply]
        • Would it be fair to say that we could resolve this discussion if we had two tools. The first would automate the adding of (currently) unresolved citations as place holders in the list of cites work (P2860) so that information on citations to works that don't currently exist in Wikidata is not lost. This is @ GerardM:'s use case. The second tool would go through these placeholders and attempt to replace them by items, either by generating those items if they are missing, or mapping the citation onto an existing item, thus converting the text strings to items and reducing the amount of placeholders, which is @Jura1:'s concern. Hence over time the number of cites work (P2860) links would grow, which is presumably what we all want. --Rdmpage (talk) 08:52, 4 December 2020 (UTC)[reply]
          • Supposedly that could help, but we have several tools that should already help with the usecase presented in the proposal. Did either of you already try Cradle? --- Jura 10:25, 4 December 2020 (UTC)[reply]
            • Why suggest a tool that starts from the assumption that all the information is available? A tool that is not as functional as SourceMD ? Why not consider that I need a method to identify the incomplete basics like a DOI or an ISBN and goes away to other resources, gets the information I do not have and create an item. To then replace the PLACEHOLDER and not provide me with a MickyMouse contraption that does not function as a placeholder because of the assumption that it is only a string and not something that is to be replaced by a newly minted QID. Thanks, GerardM (talk) 10:43, 4 December 2020 (UTC)[reply]
              • I think there is a degree of mutual incomprehension in this discussion. @ Jura1: I hadn't seen Wikidata:Cradle, the tool itself seems down, and it doesn't really address the problem. I think the fundamental issue here "when do we create cites work (P2860) links?" If we have to do this every time we add a publication, then for each paper with n citations we have to potentially add n more items to Wikidata (if none of the papers cited already have an item). I suspect this is what @ GerardM: is frustrated by. Furthermore, if we take this to its logical conclusion, then for each cited item we add, we would need to add items for all the papers that the cited item itself cites, and our task quickly explodes. An alternative is to add {{P|2860} to connect citations that already exist in Wikidata, then add the remaining ones as semi-structured information for a tool to come along attempt to update those to items. If we don't add the semi-structured information then it is effectively lost unless someone else (or a bot) comes along, reharvests the source data, and repeats the process. If we store the semi-structured data we avoid reharvesting from source, and also provide information for bots to convert the semi-structured citations into items. In any event, I think we can achieve all this with the appropriate tools (which don't seem to exist at the moment), I hope to experiment with some solutions in the near future. --Rdmpage (talk) 11:41, 4 December 2020 (UTC)[reply]
                • if we take this to its logical conclusion, then for each cited item we add, we would need to add items for all the papers that the cited item itself cites, and our task quickly explodes.": if that was true, we wouldn't create any item for anything at Wikidata. --- Jura 11:52, 4 December 2020 (UTC)[reply]
                  • @Jura1: I fear that you misunderstand my point. I am talking about the specific case of how much work should we do when adding one bibliographic item to Wikidata, and how can we do that most efficiently and at the same time flesh out the citation graph without having to effectively crawl a large chunk of the citation graph each time we add a single publication. My apologies if I've not been able to explain this clearly. --Rdmpage (talk) 12:12, 4 December 2020 (UTC)[reply]
  •   Oppose (in addition the above) not a suitable approach to manage completeness of items (or create only complete items). --- Jura 12:32, 4 December 2020 (UTC)[reply]

Thank you so very much.. I now know how to do this manually.. Is there a way to query for this special value? Thanks, GerardM (talk) 19:46, 8 December 2020 (UTC)[reply]