Large-scale books metadata imports?

edit

When checking if some book has an item in WikiData, most of the time I found it to be missing. Are there not large scale imports of available book sources? I have not checked in depth, could you please post some links about what's already there? It could also be good to have this info on the WikiProject page directly.

I think books are a kind of item that really shouldn't be added manually by hand, but rather be imported at large scale as the former would be a futile way to bind valuable volunteer time resources and because there already are quite large books databases.

There is the OpenLibrary data, ISBNdb.com (maybe they'd make it free on request), and BookBrainz (downloadable). Some further ways are here, many other books sources with potential APIs or dumps are listed here, and since recently one could also use Anna's Archive. For the latter it seems one could download Anna's Archive ISBNdb scrape json dump. The latter currently seems to be the most constructive efficient best approach.

Tools would check if the item already exists, and if it does update it with any additional data; any data conflicts should also be tracked so people can set whatever is better or both (either via semi-automatic edits or by changing the import scripts or properties so that it can add both).

I kind of thought most books in online databases would already be in Wikidata but apparently they aren't. The imports of scientific studies are even less extensive; things like Wikidata:Scholia could start to be useful (e.g. by AI-set "main subjects" data and statistical charts) if maybe 40% of all studies or 60% of cited/notable ones were included but it seems like currently not even 5% of all have been integrated. Prototyperspective (talk) 12:39, 20 February 2024 (UTC)Reply

@Prototyperspective: Yes, it is strange and interesting. Currently, "only" ~70,000 books and ~500,000 versions/editions/translations are available in Wikidata. According to Google, there are over 120,000,000 different book editions. I think we should have an item for each (here in this wikiproject the writen work/edition concepts are prefered, I am fine with it, it is a good solution). But if that is too much, we could start creating items for all written works and editions by any author who has an item. And import any written work/edition by an author when he/she is added to Wikidata, gradually improving our coverage. I am trying to do that with CC-0 metadata from the National Library of Spain (see this). Emijrp (talk) 21:37, 21 February 2024 (UTC)Reply
There are HUGE numbers of books not here yet, but part of the problem includes: the inconsistencies in various databases, leading to duplicate listings that have to be manually identified and corrected, as well as incorrectly merged items in those other databases that get mismatched with existing items here. There is also inconsistency in major library databases, such as the Library of Congress in the US. There have also been attempts to import data from Wikipedias, but these have resulted in a mishmash of data that then has to be manually cleaned up. Things like an ISBN on a 19th century publication. But you are correct, there are many, many books not yet present in Wikidata. --EncycloPetey (talk) 22:40, 21 February 2024 (UTC)Reply
Well it is a question. For example Czech National Library, which has a bibliography for about 7 milion entries was blocking the relase even it should be Public Domain for ages. Recently I heard that they finnaly release it. So it may be the reason, that bibliography holders are blocking that elsewhere.
Regarding the manual adding, I am doint that. I am doing that because of previous statement, that we were not sure, wether we would be ever able to import Czech bibliography, so I would not prohibit that. Or better to say, why contributor cannot do what they want? Juandev (talk) 22:44, 25 February 2024 (UTC)Reply

Personally I don't think it's particularly useful to do mass imports of book metadata into Wikidata right now (though eventually we should, let's say within the next 20 years). If you're interested in mass imports, it's probably best to start importing into OpenLibrary, where the data is actively used and maintained. There are still big gaps there, as the bulk imports focused on some countries. If/when Wikidata starts having a use case for metadata about millions of books, it will be easier to import from OpenLibrary. Nemo 07:54, 23 February 2024 (UTC)Reply

If Wikidata does not even have a comprehensive dataset on books what exactly is it or will it be good for? Books-metadata is one of the first things that comes to mind where an open freely-accessible structured dataset would be useful (once it's as comprehensive as alternatives).
  • In addition, if volunteers enter the data by hand due to missing data on books, that draws out valuable contributor-time. Fixing this issue thus improves the state of open knowledge overall.
  • The use-cases include scientific research, visualization, Scholia, integrating things like a module for "Most-cited, most-popular, most-relevant books about this topic" for Wikipedia articles, archives completeness evaluations, upcoming AI-scientist software agents, structured-data based search engines, open source apps like ebook readers, and more. All of this would only be possible once the dataset becomes more complete.
  • The same also applies to other contents like podcasts, studies, foods, and software – the data only becomes valuable once it becomes reasonably complete. This is about books since for these there already readily available datasets to integrate centrally here.
One issue is that the open nature of Wikidata means that it would be near impossible to make sure people don't add vandalizing/false data into items when that many items exist. Thus, non-bot edits to these items should somehow be tracked separately so they can be checked and maybe also other measures such as semi-locking these items to only bot- and reviewed-edits. I don't think Wikidata should aim to be anything but the most comprehensive open structured data repository of the world. If it doesn't even contain books metadata it falls short of even the basics.
Concerning OpenLibrary that I also linked above: isn't that far less comprehensive than the other datasets I linked such as the ISBNdb json dump? One can bulk-revise items after bulk importing all so it's not a one-time thing.
@Emijrp: Interesting! But why don't you put the estimated number of known/not-totally-insignificant books there instead of only saying that the "Total number of books […] is unknown"? That 120 M number for example would be better than none and there can be multiple estimates each with source. Thanks also for the info about the bot, I was partly looking to see which bulk import efforts are currently being done.
@EncycloPetey: Then it seems like the code for identifying duplicate items needs to be improved. However, I think it would be difficult to misidentify books as the same when they have ISBN IDs and most items seem to have these. There could be reports of mergers / likely duplicates to review where people only need to click a button to merge or correct misidentified separate items. It would be a much better situation if all books without such issues were imported and all items with such issues were on hold and on a list of items with inconsistency issues. I think in most of the latter cases, which could be worked on at a later point, the solution could usually be as simple as importing the data from both databases with a reference to the database so we have both and/or can pick what is better. This wouldn't mean a mishmash but having the data of both so data-users can simply choose which data to pick. Prototyperspective (talk) 13:13, 23 February 2024 (UTC)Reply
No, ISBNs do not exist on "most books". In fact the majority of books present on the Wikisource projects do not have ISBNs because those books predate the invention of ISBNs. Believing that ISBNs will somehow solve problems is a naive approach. Some editions have multiple ISBNs associated with them, and many, many books have no ISBN associated with them. --EncycloPetey (talk) 17:39, 23 February 2024 (UTC)Reply
I was referring to most items in the data dumps, especially the json one.
Wikisource probably has mostly very old books. If "most items" (you misquoted me btw) in the dumps indeed don't have an ISBN, then one could at least import those that have and I was just asking about what's being done in regards to book imports and what the difficulties are. If ISBNs are not a good way for a substantial fraction of importable items then I guess one could use book title + author and I wasn't coming to this claiming to have a fully fledged out way that I'm proposing or anything like that. Prototyperspective (talk) 18:43, 23 February 2024 (UTC)Reply
Also, are you away of the difference between a work and an edition? ISBNs are for editions, and there is no easy way to automate connecting editions to their work data item. I've seen other folks do automated imports which went very wrong in this regard. --EncycloPetey (talk) 20:57, 23 February 2024 (UTC)Reply
Well the ISBN issue complicates things. But I think that is more about how this would be implemented and potential difficulties. Maybe the items are already connected to their work items in the datasets so that could be used to some extent. And if not, one could check the title, author, and publication date fields to connect things or to only import a book once, not once per edition in the dataset.
For things that are a bit unclear (like slightly different title), these could be written to some 'bot suggestions to review' queue. I don't know how much benefit/value there is in having data on many editions but I do think having as many published books in the database as importable would be useful and that having associated ISBNs with their respective language linked in it would be too (so that one can e.g. enter an ISBN and directly get data on the book). Prototyperspective (talk) 18:46, 17 July 2024 (UTC)Reply

What about audiobook?

edit

In my understanding audiobook (Q106833) is distribution format (P437) of version, edition or translation (Q3331189). But I don't see any information on this page and real usage in Wikidata is different. Am I right? Skim (talk) 21:03, 24 February 2024 (UTC)Reply

I think these are entered in the audio field when an audio is available. I guess multi-part audiobooks should be converted to one file. I have added a few audiobooks from these and there are also many Librivox audiobooks. See Commons:Category:Audiobooks by language. It would be nice if a script / bot was made that added the files we have on WMC and maybe also Internet Archive hosted ones. Then maybe audiobooks could be added as a feature to WikiVibes and be auto-displayed on the associated Wikipedia page. Probably you're asking at least mainly about sth different, I don't know what you're asking about...whether or not "audiobook" is widely set on distribution format if and audiobook is available. If there is no script / bot that checks audiobooks sites like Audible for audiobooks and adds them to distribution format that would be valuable in the sense of data completion, I think the more important task would be making it useful or doing things that are already useful rather than just completing data for the sake of completeness. Prototyperspective (talk) 11:04, 11 June 2024 (UTC)Reply

Version or book for P31

edit

Recently, well about year ago, Wikidata started to push me not using Book or Journal as a value for P31, but Q3331189 (version). So OK, but where do I indicate it is a book or journal then? Moreover if P31 equals to Q3331189 services like Zotero have a problem to map it and create a Zotero citation out of it. Juandev (talk) 22:39, 25 February 2024 (UTC)Reply

For journals you have Wikidata:WikiProject_Periodicals. Skim (talk) 21:32, 27 February 2024 (UTC)Reply
I use the distribution format (P437) property (usually as a qualifier to ISBN) with the ebook (Q128093), printed book (Q11396303), hardback (Q193955), softcover (Q990683) and paperback (Q193934) values for books. D6194c-1cc (talk) 05:33, 6 March 2024 (UTC)Reply
I see, thx. Juandev (talk) 06:41, 23 July 2024 (UTC)Reply
Also, take a look at book edition (Q57933693). D6194c-1cc (talk) 07:46, 8 March 2024 (UTC)Reply

Notability of vanity press

edit

Hi,

With Fralambert, we are wondering if vanity press (self-publication) are notable enough for Wikidata. For instance, Projets personnels et gestion du temps, pistes d'organisation (Q62648172)/Projets personnels et gestion du temps, pistes d'organisation (Q62662230) (especially in this case where the items have not sitelink and are only linked to each other, so they clearly fail the 1st and 3rd criteria of WD:N, but what about the 2nd?).

Cheers, VIGNERON (talk) 15:45, 18 May 2024 (UTC)Reply

@VIGNERON I would only have a general comment on this. The entry should only be established if it was published in another database where the author cannot insert it himself. I don't know if Wikidata has a list of, let's say, less suitable databases, where the author can insert it himself. For example, anyone can buy an ISBN, so that's also probably an example of an identifier that clearly doesn't make the item a suitable item.
Well, the second thing is that we could make an auxiliary rule. That is when we are unable to assess inclusion in neutral databases, we look for other ways to predict the appropriateness or inappropriateness of inclusion. But now I can't think of anything, especially since the data is not Wikipedia. Juandev (talk) 06:29, 23 July 2024 (UTC)Reply
But well my answer is more-like for self-publications in general. Above I can see a link to a discussion on a topic. Juandev (talk) 06:32, 23 July 2024 (UTC)Reply

Format for Adding Indexers on Book Items

edit

Wanted some advise by Wikiproject Books, on Which format would be more suitable to add/credit 'Indexers' in published works (before I try for a full property proposal):

Index by (Indexer/Indexed by)
  Chitra Karunanayake
0 references
add reference
add value

Or,

editor
  Mei Yen Chua
subject has role indexer
subject named as contributing editor
0 references
add reference
add value
  • The first option is harder to find direct references for, however they do exist, and would overall be better as Wikidata could add more value to these book items and the people who Index.
  • The second option has references, including Amazon where Indexer's are credited as 'contributing editor'.
  • There are also many Database Indexers, so if a property is created, it would be useful for databses as well.

Or, are there other alternatives? Would anyone be interested in combining catalogers and indexers as one property? What issues do you feel could arise with these options or a future property proposal? Wallacegromit1 (talk) 11:46, 2 June 2024 (UTC)Reply

Labels for edition

edit

Hi y'all,

Help:Default values for labels and aliases is moving along. I would like to add version, edition or translation (Q3331189) as an example where the "mul" label should be used. What do you think?

Cheers, VIGNERON (talk) 13:15, 3 July 2024 (UTC)Reply

Seems reasonable, but not sure if we need to distinguish between version, edition or translation (Q3331189) and scholarly article (Q13442814) under "Titles". --Jahl de Vautban (talk) 13:26, 3 July 2024 (UTC)Reply
@Jahl de Vautban: true, for people knowing bibliography it may seem obvious and a bit redundant ; but for the sake of clarity, I would prefer to be explicit and list both (conversely, written work (Q47461344) maybe seem similar "it's also a book" but is not in the same case). Cheers, VIGNERON (talk) 08:36, 5 July 2024 (UTC)Reply
The default labels are here now! it should be fully activated on August 12th. Any objections to implement it on items about edition? At least thinking together about how exactly adding the "mul" labels. See what I did on Le chevalier de Saint-Georges (Q23570001) or La Bretagne des mégalithes (Q30721976) for simple examples. Pinging top-contributors @Sic19, Epìdosis, Akbarali, Hsarrazin, Salgo60, Jarekt:. Cheers, VIGNERON (talk) 12:56, 31 July 2024 (UTC)Reply
I don't have objections, the label for an edition item should just be the same in all languages IMHO. Epìdosis 13:06, 31 July 2024 (UTC)Reply
Are you using "book" in the default settings? We've advocated against using book in previous situations. "Book" means too many possible things. --EncycloPetey (talk) 16:04, 1 August 2024 (UTC)Reply
@EncycloPetey: I don't understand, I'm talking about version, edition or translation (Q3331189). Also, I don't use "book" ; in fact, I regularly mass-remove the instance of (P31)book (Q571). Cheers, VIGNERON (talk) 13:28, 28 August 2024 (UTC)Reply
You pointed to Le chevalier de Saint-Georges (Q23570001) as an example, where the description has book (en); Buch (de); and libro (es) in the description fields. That's not a good example because it uses "book". --EncycloPetey (talk) 13:32, 28 August 2024 (UTC)Reply
@EncycloPetey: I see, I removed the wrong descriptions on this item, thanks for letting me know. Cheers, VIGNERON (talk) 13:48, 28 August 2024 (UTC)Reply
Hi @VIGNERON - that's really a great news :)
Of course "Yes !!" for "edition" items, which should always be the Title of the edition... even with a transcription for some languages in description ;)
but, I couldn't find how to edit this "mul" language... The usual shortcut I use for Labels (L) [see gadgets] does not seem to work any more :( - and labelLister does not recognize "mul" for language (yet!) - the little pen only opens "french" (for me)... :( - HOW do you edit this "default" language... ?
and do you think it could be put somehow on top of the language box (or at the bottom) ? for now, finding it in a long list (like Victor Hugo (Q535) is rather tedious) :D Hsarrazin (talk) 08:26, 28 August 2024 (UTC)Reply
@Hsarrazin: transcription is actually a good question. I'm not exactly sure what to do there.
The default label is still not fully activated, indeed some tools still don't understand "mul". For LabelLister, since there will be no need anymore to copy-paster labels, I guess there is no need to use it. But nonetheless, it has been reported on MediaWiki_talk:Gadget-labelLister.js#Issues_with_mul_support (not sure if someone is still maintaining it...).
Yes, having the choice to put "mul" on top would be nice, IIRC, it has been asked already somewhere.
Glad to see that most people support it, I'll write an other message below for the next possible steps.
Cheers, VIGNERON (talk) 13:28, 28 August 2024 (UTC)Reply

Property Proposal: indexer

edit

Hi All,

Kindly requesting your input as   Support or   Oppose, for the following property proposal for 'indexer' at https://www.wikidata.org/wiki/Wikidata:Property_proposal/indexer

You may also Comment to improve or critic the proposal. Appreciate any constructive feedback. Wallacegromit1 (talk) 07:36, 8 July 2024 (UTC)Reply

Logic

edit

I am affraid I dont understand the logic behind the books. If I have a book, which was published for the first time, I add instance of (P31) to version, edition or translation (Q3331189), than if I want to add another value like book (Q571) or paperback (Q193934), exclamation mark appears proposes for example written work (Q47461344). But how do I indicate its a book. written work (Q47461344) could be whatever I guess. But if we write mapping scripts to use Wikidata items as citations, its important to know, what items are books as Wikipedias have different templates for books and different for the web based text for example. Juandev (talk) 06:20, 23 July 2024 (UTC)Reply

I am sorry, I havent noticed, that you have allready replied me that question above. Juandev (talk) 06:38, 23 July 2024 (UTC)Reply

Bibliographic records on Wikidata at Wikimania 2024

edit

Hey, at Thursday, August 8 Ill be talking about bibliographic records on Wikidata in ligthing talk called "Imagine a world in which every citation is generated from Wikidata". It is a part of First session of the lighting talk starting at 5.30 pm local time (which is probably CET). The talk will be available online too. Juandev (talk) 16:53, 5 August 2024 (UTC)Reply

Sounds interesting. Why is it not on Wikimedia Commons? Note that for every citation to be using Wikidata or being converted to it one would probably need better measures to prevent unchecked/covert malicious&problematic edits to a subset of Wikidata items and large-scale importing of studies and books (see the thread above). Prototyperspective (talk) 21:12, 28 August 2024 (UTC)Reply
@Prototyperspective: it should be on Commons soon (but there is 200+ hours of video to process, please be patient). Meanwhile you can see it on YouTube : https://www.youtube.com/live/wYc5gnZfnpU?si=ulAx8JrqOVaaCxzZ&t=27587 Cheers, VIGNERON (talk) 07:22, 29 August 2024 (UTC)Reply

Next steps for the default label

edit

See #Labels_for_edition for the previous related message.

@Sic19, Epìdosis, Akbarali, Hsarrazin, Salgo60, Jarekt: @Fnielsen, Mfchris84, Jane023, MartinPoulter, Jahl de Vautban, EncycloPetey:

The default label is still not activated but I think we should prepare. My suggestion is to only add "mul" labels in Latin script on editions only and not removing any label yet.

For some contexte, right now we have 540133 items with instance of (P31)version, edition or translation (Q3331189) (https://w.wiki/B2yo) with 398186 with at least one title (https://w.wiki/B2$2), among them 6793 with more than one title (https://w.wiki/B2zE ; with a bit of everything, multilingual edition, original and transcription, simple mistake/error, etc.).

I suggest to copy the title as the mul label for these items :

SELECT ?q (SAMPLE(?title) AS ?title) (COUNT(?title) AS ?count) WHERE {
  ?q wdt:P31 wd:Q3331189 ;            #edition
     wdt:P1476 ?title ;               #with a title
     rdfs:label ?title .              #with a label strictly identical to the title
  FILTER ( REGEX(?title, "^[A-Z]") )  #this title start by a Latin script character
}
GROUP BY ?q
HAVING ( ?count = 1 )                 #with only one title
Try it!

Currently, this query gives 227182 results, a bit less than half of all the editions we have. It's maybe a bit too restrictive but I prefer to be cautious (and we can still fix error before using this query), at least for the first batch of import. Do you see anything that need change or improvement in the query? Also, any preference on how to add them?

At a later date (at least once the "mul" system is activated for everyone), we could removed the duplicate labels to leave only the "mul" label.

What do you think?

Cheers, VIGNERON (talk) 15:07, 28 August 2024 (UTC)Reply

IMHO the query is OK and I see no issue in adding "mul" labels to its results on the basis of your reasoning. Epìdosis 15:15, 28 August 2024 (UTC)Reply
Yes this is probably useful for many paintings and other works too.  – The preceding unsigned comment was added by Jane023 (talk • contribs).
@Jane023: I did not thought about paintings but it could be a good class of items for "mul", the dedicated Wikiproject should also think about it. Cheers, VIGNERON (talk) 17:31, 2 September 2024 (UTC)Reply
The problem with titles of older paintings is which title is the best? The one in use by the museum (could be a language not using Latin script) or the one used by the highest regarded art historian? Maybe just starting with use cases for paintings that have Latin script titles, and then analyzing ing what is left over for better approach. Jane023 (talk) 07:20, 3 September 2024 (UTC)Reply
@Jane023: true, then paintings are not a good class for "mul" (which is not surprising as they are closer to work than to edition).
Anyway, should we think about how to move forward for editions ? Maybe we could start with a small batch as a test? like 100 items, linked to different Wikisources for instance to check if there is no problem with templates re-using Wikidata?
Cheers, VIGNERON (talk) 14:15, 9 September 2024 (UTC)Reply
Sounds good to me! Jane023 (talk) 14:26, 9 September 2024 (UTC)Reply
edit

Within inventaire.io (Q32193244), we have been using collection (P195) to link instances of version, edition or translation (Q3331189) to instances of editorial collection (Q20655472), and as editions can now be transferred from Inventaire to Wikidata, those statements start to appear here too (see example). But it has been suggested that we should rather use part of the series (P179); any opinion? I would think that if we were to have work series and edition collections using the same property, that would make it even harder to split wikidata items that are both a work and an edition. Maybe we should create a dedicated property that could then have P1629=Q20655472? -- Maxlath (talk) 12:23, 29 October 2024 (UTC)Reply

Return to the project page "WikiProject Books".