Wikidata talk:WikiProject Books
![]() |
On this page, old discussions are archived. See: 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023. |
Next steps for the default label
editSee #Labels_for_edition for the previous related message.
@Sic19, Epìdosis, Akbarali, Hsarrazin, Salgo60, Jarekt: @Fnielsen, Mfchris84, Jane023, MartinPoulter, Jahl de Vautban, EncycloPetey:
The default label is still not activated but I think we should prepare. My suggestion is to only add "mul" labels in Latin script on editions only and not removing any label yet.
For some contexte, right now we have 540133 items with instance of (P31)version, edition or translation (Q3331189) (https://w.wiki/B2yo) with 398186 with at least one title (https://w.wiki/B2$2), among them 6793 with more than one title (https://w.wiki/B2zE ; with a bit of everything, multilingual edition, original and transcription, simple mistake/error, etc.).
I suggest to copy the title as the mul label for these items :
SELECT ?q (SAMPLE(?title) AS ?title) (COUNT(?title) AS ?count) WHERE {
?q wdt:P31 wd:Q3331189 ; #edition
wdt:P1476 ?title ; #with a title
rdfs:label ?title . #with a label strictly identical to the title
FILTER ( REGEX(?title, "^[A-Z]") ) #this title start by a Latin script character
}
GROUP BY ?q
HAVING ( ?count = 1 ) #with only one title
Currently, this query gives 227182 results, a bit less than half of all the editions we have. It's maybe a bit too restrictive but I prefer to be cautious (and we can still fix error before using this query), at least for the first batch of import. Do you see anything that need change or improvement in the query? Also, any preference on how to add them?
At a later date (at least once the "mul" system is activated for everyone), we could removed the duplicate labels to leave only the "mul" label.
What do you think?
Cheers, VIGNERON (talk) 15:07, 28 August 2024 (UTC)
- IMHO the query is OK and I see no issue in adding "mul" labels to its results on the basis of your reasoning. Epìdosis 15:15, 28 August 2024 (UTC)
- Yes this is probably useful for many paintings and other works too. – The preceding unsigned comment was added by Jane023 (talk • contribs).
- @Jane023: I did not thought about paintings but it could be a good class of items for "mul", the dedicated Wikiproject should also think about it. Cheers, VIGNERON (talk) 17:31, 2 September 2024 (UTC)
- The problem with titles of older paintings is which title is the best? The one in use by the museum (could be a language not using Latin script) or the one used by the highest regarded art historian? Maybe just starting with use cases for paintings that have Latin script titles, and then analyzing ing what is left over for better approach. Jane023 (talk) 07:20, 3 September 2024 (UTC)
- @Jane023: true, then paintings are not a good class for "mul" (which is not surprising as they are closer to work than to edition).
- Anyway, should we think about how to move forward for editions ? Maybe we could start with a small batch as a test? like 100 items, linked to different Wikisources for instance to check if there is no problem with templates re-using Wikidata?
- Cheers, VIGNERON (talk) 14:15, 9 September 2024 (UTC)
- Sounds good to me! Jane023 (talk) 14:26, 9 September 2024 (UTC)
- The problem with titles of older paintings is which title is the best? The one in use by the museum (could be a language not using Latin script) or the one used by the highest regarded art historian? Maybe just starting with use cases for paintings that have Latin script titles, and then analyzing ing what is left over for better approach. Jane023 (talk) 07:20, 3 September 2024 (UTC)
- @Jane023: I did not thought about paintings but it could be a good class of items for "mul", the dedicated Wikiproject should also think about it. Cheers, VIGNERON (talk) 17:31, 2 September 2024 (UTC)
What should be the property to link an edition to the editorial collection it's part of?
editWithin inventaire.io (Q32193244), we have been using collection (P195) to link instances of version, edition or translation (Q3331189) to instances of editorial collection (Q20655472), and as editions can now be transferred from Inventaire to Wikidata, those statements start to appear here too (see example). But it has been suggested that we should rather use part of the series (P179); any opinion? I would think that if we were to have work series and edition collections using the same property, that would make it even harder to split wikidata items that are both a work and an edition. Maybe we should create a dedicated property that could then have P1629=Q20655472? -- Maxlath (talk) 12:23, 29 October 2024 (UTC)
Tool to upload to OpenLibrary and Wikidata?
editHi! I recently discovered OpenLibrary and I am starting to add books there (to avoid using Goodreads). However, it would be nice to also make that contribution to Wikidata. Is there a tool which makes it easier to upload it to both? Or do I have to do it twice? Dajasj (talk) 13:08, 13 January 2025 (UTC)
- Hello! This doesn’t answer your question, but please have a look at inventaire.io, BookWyrm and BookBrainz. inventaire.io uses Wikidata data, can export to Wikidata (authors, works, publishers, series …) and automatically links some data to OpenLibrary. BookWyrm is a federated FOSS alternative to GoodReads and uses data from both OpenLibrary and inventaire.io. --Reclus (talk) 13:28, 13 January 2025 (UTC)
- Thanks! I'll have a look at it :) Dajasj (talk) 13:34, 13 January 2025 (UTC)
Notability of encyclopedia articles
editWikiProject Books has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Hi all! As of now, we have 636659 items which have instance of (P31)encyclopedia article (Q13433827) (cf. Special:Search/haswbstatement:P31=Q13433827); sometimes I have doubts about the notability of a part of them, so I would like to clearly define it. My proposal is:
an item being instance of (P31)encyclopedia article (Q13433827) is notable if it meets at least one of the three criteria below:Incoming links through described by source (P1343) statements are not sufficient to establish notability, in absence of at least one of the three aforementioned criteria.
- at least one sitelink (usually to Wikisource), per WD:N 1
- at least one identifier (usually DOI (P356) or Handle ID (P1184)), per WD:N 2
- at least one use in references, per WD:N 3
In absence of objections, I will start deletions next week on the basis of these. --Epìdosis 11:11, 28 January 2025 (UTC)
- Wouldn't it be more desirable to remove point 3 as well? And instead use he encyclopedia (with page number) as reference rather than the encyclopedia article? As long as we still use encyclopedia articles as references, I don't think it's strange to specifically refer to encyclopedia articles with described by source (P1343). Dajasj (talk) 14:41, 29 January 2025 (UTC)
- I don’t understand why you think it’s necessary to tidy up that area. While WD:N 2 would presumably grant notability to practically all of these encyclopedic articles, your “at least one identifier” approach is, in my opinion, an extreme tightening that would throw out many useful works which are modeled at article level, thereby destroying a lot of valuable content.
- I recently did that for the articles of Biographical Lexicon for East Frisia (online version) (Q130487123) (with subclass biographical article (Q19389637)), which is a much-cited standard reference work for biographies in the region and is used hundreds of times as the only reliable source by Integrated Authority File (Q36578) and Wikipedia. But since there are no external IDs for every article, all those articles are now supposedly deletable? That doesn’t make sense to me.
- I do understand that you chose these criteria because they’re easy to query and don’t require individual case checks, but I consider those individual checks important, and I don’t see why the deletion of potentially useful and widely used entries, which are clearly notable according to WD:N 2, should be necessary. --Printstream (talk) 18:17, 2 February 2025 (UTC)
- What's wrong with described by source (P1343)? "Person X is described by an article in biographical dictionary Y, written by author Z and citing works A, B, C as authorities"--or more generally, Topic X and Encyclopedia article Y--is potentially very useful information, which would seem to be excluded from Wikidata under your proposal. Yet it would seem to fully belong here under the usual "describes a clearly identified real-world entity" and "fulfills a structural need" criteria. Hupaleju (talk) 00:46, 21 February 2025 (UTC)
- @Epìdosis: I have been active at Wikisource. If I plan to transcribe an old paper, but then pull back due to the amount of work necessary I still would like to keep the Wikidata object since I may change my decision later. What about excluding papers published before 1980 from deletion? That age was still 100 % on paper and shall not harbor neither search engine optimization (Q180711) nor predatory publishing (Q29959533).--Antifaschistische Frontschule (talk) 15:24, 4 March 2025 (UTC)
- I think it could make sense. But I'm not sure I understand the case you describe: you plan to transcribe an encyclopedia, so you create the Wikidata items first, but then you stop adding the sitelinks to Wikisource in these items and so they rest outside the criterium 1 of notability? Epìdosis 15:29, 4 March 2025 (UTC)
- @Epìdosis: I have been active at Wikisource. If I plan to transcribe an old paper, but then pull back due to the amount of work necessary I still would like to keep the Wikidata object since I may change my decision later. What about excluding papers published before 1980 from deletion? That age was still 100 % on paper and shall not harbor neither search engine optimization (Q180711) nor predatory publishing (Q29959533).--Antifaschistische Frontschule (talk) 15:24, 4 March 2025 (UTC)
Properties for types of scientific work and articles
editSo, I deal with scientific more than artistic works works, both in Article and book form (and sometimes multi-volume forms), but these should mostly the same properties as are used for creative works, correct? I'd like some advice on how to characterize things. In particular, should characterization of the scientific form of a work be in instance of (P31), genre (P136), form of creative work (P7937)?
- I'm looking at Catalogue of the Diptera of the Americas South of the United States (Q51386632) Where does the fact this is a catalogue (Q2352616) come into play exactly?
- In this case main topic is not an option: that is clearly Diptera (Q25312).
- Take a schoalrly article like Book Review: The evolution of orthopaedic surgery (Q24797163). Which property should be used to indicate that this is a book review (Q637866)?
- Not "main topic" as is currently on the item. The main topic is clearly the book being reviewed (there is no item for the book as far as I can tell)!
- What about other more specific types of scholarly articles like case report (Q2782326), letter to the editor (Q651270), scholarly letter/reply (Q110716513) (and yes these are different: not all scholarly letters in journals are replies) or review article (Q7318358)?
Circeus (talk) 18:50, 12 February 2025 (UTC)
- A book review, letter to the editor, case report, would all be a genre (P136). The genre is determined by its literary style and content. The form of a creative work is determined by its length or other physical properties. A letter to the editor, and a scholarly letter are both letters, so letter is the form of the work. Letters are short written works addressed to an individual or group. --EncycloPetey (talk) 13:09, 13 February 2025 (UTC)
- Turns out the property Iwanted was not even genre. It was publication type of scholarly work (P13046)! Circeus (talk) 18:35, 26 February 2025 (UTC)
Importing data from OpenLibrary, Goodreads & Co
editI noticed many books don't yet have Wikidata items, including very popular and/or impactful ones. I think there needs to be some bulk import from existing databases or otherwise nobody will use Wikidata but instead those other databases and WD content would be very gappy.
- First, some script / bot that adds data on OpenLibrary or Goodreads to the matching Wikidata item. The matching could be done via the ISBN, the Goodreads work ID (P8383) or the title (English or the title that matches the language of work or name (P407)). I think most books we'd want to have data on have data on goodreads/OL or that would be the first place to get some info from...there are also some other databases and publisher sites like Barnes & Noble. I don't know if ISBNs are fine to set on items that are instances of literary work but if not some bots should probably move that to any new items, one can't expect people to spend hours just to add a handful of millions of books by hand of course and again whatever imports have been done they seem to miss out on most books, even when considering only the most-read ones.
- Once there are mechanisms by which book items in Wikidata get their data populated up to some reasonable state, including importing all the ISBNs for the book, importing of books is needed to extend which books we have data on. Here one would obviously also have to check whether the book item already exists and could probably use the OpenLibrary & Goodreads APIs instead of scraping. I don't know how exactly Anna's Archive got its data but it has data on many books and maybe the same could be used. It seems like that small team project has more books data than the large Wikimedia project contributed to by many hundreds, I wonder how that can be and what the purpose of Wikidata is when it doesn't achieve superiority or coequality on any field of data that is actually used by people (books, films, music, food). One could also use the Anna's Archive metadata dump and import data from that.
Is there any effort going on to mass import books data and if not could somebody please step up to that task? Prototyperspective (talk) 01:35, 25 February 2025 (UTC)
- Please do not use the Anna's Archive metadata dump to populate Wikidata. The IP issues involved are potentially very significant and could even put Wikidata itself at risk, see Wikidata:Requests_for_comment/Anna's_Archive#IP_concerns_may_be_quite_severe_here for a description of the rather obvious concerns with this. OpenLibrary's CC0 release appears to be entirely unproblematic by comparison, so we should just focus on that instead. Hupaleju (talk) 20:53, 28 February 2025 (UTC)
- Disagree. It's just metadata and it doesn't matter whether it comes from Anna's Archive, Goodreads, or OpenLibrary albeit it's certainly preferable as much as possible is retrieved from Open Library. It's just factual data, e.g. you can't copyright the information of the number of pages a book has. In any case, people can just import the data without announcing the details of how they do it. Prototyperspective (talk) 00:56, 1 March 2025 (UTC)
- The Goodreads and OpenLibrary data are jumbled garbage for the majority of works I've investigated, with incorrect dates, incorrect information about editions, and patent nonsense. Mass importation of garbage is a bad idea. And, no, ISBN values should never be set for a literary work; they are specific to editions of works. --EncycloPetey (talk) 01:59, 1 March 2025 (UTC)
- Thanks for contributing your insights into these – the question you skipped there would be whether the data can be cleaned. And also whether it's better to have lots of mostly correct correctable data or just a random (ie not the most notable) 1% of books in the database. The data in Wikidata itself is also just mostly correct and has lots of garbage in it in your terms, there's many items that were initially correct but were edited to introduce false info or had false data to begin with.
- Furthermore, every time I checked these sites for data all of it was correct and for the publication data – just import the year but not the month and day and that should be fine. Don't know which data of which sample subset you looked at.
- I know ISBNs should not be set for literary works as things are currently but if there are problems with how things are currently then one should consider changing ways. In particular, one can consider setting many ISBNs on the item and just specifying things via qualifiers to it like language and year. One could still have separate items for editions for very notable works but for normal books, what's the actual need and value in having separate items for these. There may be substantial value in it, but not as much as the value added by finally making Wikidata useful for books by making it have items on most books. Prototyperspective (talk) 02:15, 1 March 2025 (UTC)
- I would like to second @EncycloPeteys perspective. The data in Goodreads and OpenLibrary is not just problematic because it contains a lot of errors - though it does - but also, for example, because it contains an enormous amount of duplicate entries. Directly importing this data into Wikidata would create huge cleanup problems. Obviously, cleanup is always possible, but currently we are struggling even to clean up the bibliographic data that is already there and continuously being added. On the IP issue, we have to not just consider copyright, but also, for example, the European Database Directive, which makes complete duplication of databases problematic even when they only contain factual information. Finally, I don't see any advantage in storing information about editions in qualifiers - particularly because that would result in the same kind of information being stored in two different ways in different items, which is undesirable from the perspective of querying the database. Pfadintegral (talk) 09:10, 1 March 2025 (UTC)
- Good points I guess but again I think there would need to be thinking/discussion/work on how the data could be cleaned.
- Moreover, you and EncycloPetey claim the data is super bad but have not yet provided any data/evidence/sources backing it up...it's not my experience with these sites and they seem to be heavily used and the best most complete out there.
- There is no need to perfect the cleanup of data in WData first before importing more books I think – on the contrary one should I think first make sure the database contains data on most books and deduplicating can then be part of the second step. In short, I'm not suggesting it's directly imported. I think one could use the title and the ISBNs and author fields to deduplicate during and after import.
- Yes, it's good and necessary to consider all that. Has somebody looked into the European Database Directive and how imports are compliant with it? I don't think it can just say that simple factual public data in json format or whatever is somehow not importable. For example, see part "by reason of the selection or arrangement of their contents". Maybe it needs some proactive inquiry.
- I don't care whether data about different editions are stored as separate items or within the item about the book. I mean it would probably make it easier (and not as super time-intensive) to create book items manually or to find info about them and maintain it but I haven't brought this up because of any intrinsic advantage of that. It's that people shouldn't be required to create 20 items over an hours just to add 2 more books into Wikidata. It should be done by imports and if something is done manually, much of the work be done by bots. Additionally, elsewhere people brought up that it would result in many more items but the number of new items would be far lower if we'd have ISBNs set on the item about the literary work which could have qualifiers with the info people would put into the separate item about the edition.
- in the same kind of information being stored in two different ways in different items Adjusting the query so that it combines results of both ways of storing this info or rather providing an example query people can readily use and adjust that does that is far better than Wikidata not having items on most books and thus not being useful and used for book-related things like inventaire.io (Q32193244). Let me know if you want to me to bring up such a query. Alternatively, one could also copy things that are currently in separate edition items into the literary work items so all of it is stored in one way. Later, one can still consider having separate items for every single edition of every book in Wikidata and moving the data accordingly.
- Prototyperspective (talk) 15:11, 1 March 2025 (UTC)
- To be clear, the data on Goodreads and OpenLibrary is in general very valuable and useful. But since you are asking for concrete examples, let me illustrate the quality issues on OpenLibrary, choosing as a random sample the last book I happened to edit, "We Can Build You" by Philip K. Dick, and look at the immediately obvious issues there:
- The work entry https://openlibrary.org/works/OL2172520W lists Philip K. Dick and Dan John Miller as co-authors. This is wrong; Dan John Miller is merely the narrator of one audiobook version of the novel.
- A search reveals at least four other work entries describing the same work, but not linked to the one above - https://openlibrary.org/works/OL32437539W, https://openlibrary.org/works/OL26131798W, https://openlibrary.org/works/OL27360824W, https://openlibrary.org/works/OL27073446W - one of them attributed to "Penguin Books Staff" as author instead of Dick
- Looking at the individual edition items linked from the main work entry, https://openlibrary.org/books/OL31957716M and https://openlibrary.org/books/OL3665369M appear to be duplicates, as do https://openlibrary.org/books/OL7259231M and https://openlibrary.org/books/OL9213946M.
- And this is without checking page counts, publishers or publication dates for accuracy. Pfadintegral (talk) 19:02, 1 March 2025 (UTC)
- Thanks for these examples!
- Those are all OpenLibrary ones however, I thought Goodreads was better in terms of the data even when OpenLibrary is more aligned with Wikimedia values/goals/model.
- Regarding the second point: I don't think it's of primary priority to import all editions within one language (albeit it would be good to have all ISBNs covered) – it's more important that there is one item for the work. One could import all those items and then establish the linking or do that during the import by e.g. checking for other items with the same book title and at least one overlapping author.
- One could also clean up data via various scripts afterwards like checking for authors with "staff" in the name. However, I think the dataset Anna's Archive uses has better metadata. where this isn't as much of an issue in the first place. Regarding the duplicate items: if they don't have separate ISBNs like the case here one can check page numbers, authors, year and title to see whether it's a duplicate. Deduplication is an established standard practice and it would be applied also here if needed but it may not be needed with the AA dataset (haven't looked into it much though).
- Prototyperspective (talk) 01:44, 2 March 2025 (UTC)
- Thanks for these examples!
- To be clear, the data on Goodreads and OpenLibrary is in general very valuable and useful. But since you are asking for concrete examples, let me illustrate the quality issues on OpenLibrary, choosing as a random sample the last book I happened to edit, "We Can Build You" by Philip K. Dick, and look at the immediately obvious issues there:
- Note that the underlying issue is not restricted to the Database directive at all. Even in the U.S. and elsewhere, there are very real legal protections against the use of 'misappropriated' 'confidential' information that is of significant commercial value, even when such information is purely factual - and the Anna's Archive issue seems like one obvious case where the dataset might quite likely be construed as such. Hupaleju (talk) 16:59, 1 March 2025 (UTC)
- That data is not confidential and misappropriation is not applicable either. The point wasn't about it being purely factual – used that term more for the lack of a better one – it's about it being merely factual metadata...e.g. there is just one title and ISBN to a book edition. There may be issues so some people should look into, e.g. how to import data in a way that is fine. When I'm creating an item for a book I open a new tab, in that tab search for the book on goodreads or the publisher page, then copy and paste (ctrl+c & ctrl+v) information like the title and ISBN (no I don't know them by thinking hard about it). Many add data that way even if they don't disclose it. A machine is allowed to do the same. Moreover, people would prefer if the machine doesn't scrape but uses an API and a human could theoretically do the same if they use some other client (e.g. let's say I have some goodreads app and instead check there). Furthermore, instead of doing it all anew one could simply use the aggregated data somebody has already compiled. Some volunteers more knowledgable in these subjects need(s) to look into it. Prototyperspective (talk) 01:32, 2 March 2025 (UTC)
- I would like to second @EncycloPeteys perspective. The data in Goodreads and OpenLibrary is not just problematic because it contains a lot of errors - though it does - but also, for example, because it contains an enormous amount of duplicate entries. Directly importing this data into Wikidata would create huge cleanup problems. Obviously, cleanup is always possible, but currently we are struggling even to clean up the bibliographic data that is already there and continuously being added. On the IP issue, we have to not just consider copyright, but also, for example, the European Database Directive, which makes complete duplication of databases problematic even when they only contain factual information. Finally, I don't see any advantage in storing information about editions in qualifiers - particularly because that would result in the same kind of information being stored in two different ways in different items, which is undesirable from the perspective of querying the database. Pfadintegral (talk) 09:10, 1 March 2025 (UTC)
- The Goodreads and OpenLibrary data are jumbled garbage for the majority of works I've investigated, with incorrect dates, incorrect information about editions, and patent nonsense. Mass importation of garbage is a bad idea. And, no, ISBN values should never be set for a literary work; they are specific to editions of works. --EncycloPetey (talk) 01:59, 1 March 2025 (UTC)
- Disagree. It's just metadata and it doesn't matter whether it comes from Anna's Archive, Goodreads, or OpenLibrary albeit it's certainly preferable as much as possible is retrieved from Open Library. It's just factual data, e.g. you can't copyright the information of the number of pages a book has. In any case, people can just import the data without announcing the details of how they do it. Prototyperspective (talk) 00:56, 1 March 2025 (UTC)
New proposal for "applies to volume" property
editsee at Wikidata:Property proposal/applies to volume. حبيشان (talk) 16:33, 12 March 2025 (UTC)