NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The missing catalogue: why finding books in translation is still so hard (blogs.lse.ac.uk)
AusiasTsel 4 days ago [-]
Author here. The piece is about bibliographic infrastructure, but the finding that surprised me most while building the dataset was language-specific: Catalan/Valencian (~10M speakers) jumped from near-invisibility in commercial aggregators to 8th place globally once nine national library catalogues were cross-referenced. Bengali, Thai and Urdu —all with substantial publishing industries— remained near the bottom, not because translations don't exist but because the institutions documenting them haven't been connected yet. The 97% figure (editions appearing in only one of 14 sources) held across every sample I could run. Happy to answer questions about methodology, source coverage, or why ISBN metadata is such a mess.
btrettel 1 days ago [-]
Have you all considered adding scientific articles to your bibliographic database? Finding existing translations of scientific articles can be a real pain. I know because I spent a lot of time doing that during my PhD [1].

For a while I was collaborating with Victor Venema in the volunteer organization Translate Science [2] to try to create a bibliographic database of scientific translations, but unfortunately Victor died, and I became too busy to continue.

[1] https://academia.stackexchange.com/a/93209/31143

[2] https://translate-science.codeberg.page/

AusiasTsel 23 hours ago [-]
Thanks for the link; Translate Science is exactly the kind of gap-filling project that makes sense once you see how fragmented the bibliographic layer is. Sorry to hear about Victor; I'd seen the repos but hadn't known.

Scientific translations are a different animal from what I've been working on, in ways that make them both easier and harder. Easier because scholarly communication already has a near-universal identifier (DOI) and, in principle, Crossref metadata. Harder because most translated articles never get their own DOI — they live as post-hoc PDFs on an author's site or inside an institutional repository (HAL, SciELO, J-STAGE, NII) with no machine-readable back-reference to the original, and the original's Crossref record almost never points at them. So the signal is worse than with books despite the underlying infrastructure being better.

The approach that might transfer: instead of trying to convince publishers or journals to register translations (they won't), scrape what's already sitting in institutional repositories and national scientific databases, then reconcile by author + title fingerprint + language. The multilingual matching pipeline I use for books is probably the right shape for the article problem too, though the authority side is messier there. ORCID helps; affiliations drift and make it harder.

Not something I'm committing to build, but I'd be curious to see what you and Victor had assembled if any of it is still reachable. Happy to compare notes offline if useful.

btrettel 6 hours ago [-]
Thanks for the reply. You're right that the data for this is very fragmented. Victor was looking at Crossref metadata. I think he always had what he was doing on Codeberg, though I'm not sure. I was looking at arXiv and 1960s to 1980s printed translation indices listing translations on paper that are today in archives uncatalogued at the Library of Congress, British Library, and other libraries/archives. (The indices list which libraries have each translation and what it says is accurate for the Library of Congress in my experience.) OCR was not cooperating on turning my scans of the translation indices into something I could parse, despite the indices having a regular structure indicating that they were computer-generated. LLMs likely would help with that now, but all of this was pre-ChatGPT. My plan was to automatically convert the bibliographic data in the indices to DOIs, but as it turns out, a large fraction of the articles in the indices do not have DOIs. We ultimately did not consolidate these sources.

Anyhow, it's obviously a huge task and I don't expect you to build this. I was just curious if you had thought about it as you clearly have a lot of relevant infrastructure in place. If I ever get the time and interest to work on this again, I'll reach out to you.

shermantanktop 1 days ago [-]
I deal with similar issues. Translation is sometimes thought of as a mechanical process, but it is a creative process where the translator’s approach varies from subtle to heavy-handed. At some point the translation can be thought of as a new creative work, and that line is hard to define.

One of my parents was a translator who worked directly with authors, and in the review process the author would expand or refine the text in ways that were not present in the original. At that point, which work is the true representation of the authors intent, the fixed original or the updated translation?

AusiasTsel 23 hours ago [-]
That question is one of the reasons I ended up building this the way I did, rather than collapsing translations into a single canonical record.

The tradition your parent was in (translator working directly with a living author) produces some of the most interesting edge cases in translation studies. Kundera is the famous one: he eventually treated the French translations of his Czech novels as the authoritative versions and had the Czech editions revised to match. Borges did something similar on a smaller scale with his English translators. Beckett translated himself between French and English and the two versions don't agree. In each case, "which is the book" is genuinely undecidable on textual grounds.

The decision I made early was that the database shouldn't try to decide. Every edition gets its own record with its own metadata. If an author revised through a translator, that shows up as a later-dated edition in the original language with a different publisher or an explicit translator credit; the relationship is visible but not adjudicated.

It turns out this is also the only stance that survives contact with reality across the national library catalogues I've integrated. Each catalogue already encodes its own editorial judgment about what counts as a "work," and forcing them into a single hierarchy produces more bugs than insights. Letting the plurality stand is both philosophically honest and, as it happens, technically cheaper.

gobdovan 1 days ago [-]
It's so interesting to think about how there's fewer 'Le Petit Prince' versions in French (which there seems to be only one) vs in Chinese, where there seem to be at least 50 versions. [0]

You could argue that there's more experimentation and creation in other languages than the original just because it's socially acceptable to do 'yet another translation', but not a newer version in the same language (unless it's a manual or technical material).

[0] https://www.cjvlang.com/petitprince

mysterypie 1 days ago [-]
> it's socially acceptable to do 'yet another translation', but not a newer version in the same language

I wish they'd teach with modern English translations of Shakespeare in high schools. Maybe then kids would like it a lot more. But it seems like it's taboo to read Shakespeare in anything but the original.

lamasery 1 days ago [-]
They do. One series often used is "No Fear Shakespeare". Facing-page "translation", relatively cheap.

It's much better to watch it performed, though. The context the actors provide gets one past much of the difficulty with vocabulary or what have you. But yeah they do insist on reading them in school.

> But it seems like it's taboo to read Shakespeare in anything but the original.

You're definitely losing most of the sublimity in his actual words, if you don't read the original. Especially if the "translation" is into English at e.g. a 9th-grade reading level.

In the case of Shakespeare in particular (and also certain archaic translations of the Bible, notably the King James) modernizing/simplifying it may alter the language enough that the reader may not recognize unacknowledged (because of course your reader will know their Shakespeare) quotes from his works in other works, which quotes are everywhere even in things like modern popular cinema or TV. A big part of why you read Shakespeare to begin with is that his influence is so extensive that you practically have to, or you'll be missing one of a very-few not just helpful, but nigh-necessary, keys to understanding the rest of English literature (broadly, to include things like movies and video games and TV and so on)

AusiasTsel 23 hours ago [-]
You're right that version and edition aren't the same thing, and the catalogues I'm working with don't model "translation" as a first-class field — translator credits live in free-text author fields and are wildly inconsistent across national libraries. The cleanest proxy I can offer is distinct publishers per language, read alongside the edition count. For Le Petit Prince, top languages by edition count:

  Language    Publishers   Editions   Ed/Pub
  English         518        1,245      2.4
  Spanish         416        1,055      2.5
  Japanese        204          965      4.7
  French          312          928      3.0   (original)
  German          199          666      3.3
  Italian         184          641      3.5
  Chinese         233          361      1.5
  ...
  Hebrew            3          138     46.0
Two caveats are visible in the table. Publisher names aren't normalized across catalogues, so high counts in big markets (English, French) are inflated by imprint variants of a single house — Gallimard, Gallimard Jeunesse, Éditions Gallimard, Folio all show up as distinct. At the other extreme, Hebrew with 3 publishers on 138 editions is the proxy's other failure mode: one or two canonical translations reprinted repeatedly. So the number is directional, not absolute.

The Chinese row is the cjvlang pattern in distilled form: 233 distinct publishers with an edition-to-publisher ratio of 1.5 means most Chinese publishers hold their own translation and reprint it only a handful of times before being displaced. That's consistent with — and probably a conservative reading of — cjvlang's "at least 50 versions" figure.

One extra wrinkle worth flagging: "Chinese" in that row isn't one language. National library catalogues collapse at least five Sinitic languages — Mandarin, Cantonese, Wu, Min Nan, Hakka — under a single "zh" tag. Wikidata records separate Petit Prince translations in Cantonese, Wu, Hakka, and Min Nan, each with its own transliterated title ("Séu-Vòng-Chṳ́" in Hakka, "Sió Ông-chú" in Min Nan), but no national catalogue I pull from surfaces them as distinct. The same kind of collapse applies to Arabic, where "ar" hides Modern Standard plus several regional varieties that have their own literary traditions. So the 361 Chinese figure is already aggregating over a hidden second axis of variation.

Japanese tells a different story: slightly fewer publishers (204) but almost five editions each, suggesting fewer distinct translations reprinted more widely. And the French baseline is dominated by one rights holder (Gallimard family), which is what you'd expect from an original-language market with a single canonical publisher.

Retranslation within the source language is gated by copyright (Berne + 70 years post-mortem is a hard wall for most 20th-century work), the industry's default assumption that one canonical edition per language is enough, and reader expectation of fidelity when the original is in your own language. Saint-Exupéry entered public domain in France in 2015 and the French retranslation flow didn't materially open up — which I read as the publisher-economics side of your point dominating over the legal side. Retranslation into foreign languages has none of those brakes: every generation can argue its predecessor's Chinese / Japanese / Korean Petit Prince is dated or was done from English rather than French (often true), and a new translation is a lower-risk bet than trying to displace a domestic novel.

Shakespeare is the visible English counterexample: "no-fear" modernizations, facing-page editions, precisely because the original has drifted far enough from contemporary English to be partly opaque. The Bible is the other obvious case. So "retranslation-within-language is taboo" breaks down once the time distance gets large enough — roughly when the original stops being read without friction.

tjirrkkkk 1 days ago [-]
Proper ISBN id is a lot of unpaid expensive work. If you run small print, you may have sent like 10% of all your prints to libraries at your own expense. Putting unregisted pdf on web is for free...
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:36:36 GMT+0000 (UTC) with Wasmer Edge.