What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels
Allen Riddell, Troy J. Bassett

TL;DR
This study investigates which 19th-century English novels are digitized and finds that digitization favors certain types, such as those by men and in multivolume formats, indicating biases in digital surrogates.
Contribution
It reveals non-random patterns in digital surrogate availability, highlighting biases based on author gender and book format in historical digitization efforts.
Findings
Digitization favors novels by men over women.
Multivolume novels are more likely to be digitized.
Biases in digitization likely extend to other genres and periods.
Abstract
Library digitization has made more than a hundred thousand 19th-century English-language books available to the public. Do the books which have been digitized reflect the population of published books? An affirmative answer would allow book and literary historians to use holdings of major digital libraries as proxies for the population of published works, sparing them the labor of collecting a representative sample. We address this question by taking advantage of exhaustive bibliographies of novels published for the first time in the British Isles in 1836 and 1838, identifying which of these novels have at least one digital surrogate in the Internet Archive, HathiTrust, Google Books, and the British Library. We find that digital surrogate availability is not random. Certain kinds of novels, notably novels written by men and novels published in multivolume format, have digital surrogates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
