Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models
A. B. Riddell

TL;DR
This paper introduces a novel method using profile hidden Markov models to automatically generate accurate, print artifact-free ebooks from multiple print editions, benefiting accessibility and digital preservation.
Contribution
It applies profile HMMs to model and merge multiple print editions into high-quality ebooks, a novel approach in digital text processing.
Findings
Successfully produced ebooks with accurate transcription
Eliminated print artifacts like hyphenation and headers
Demonstrated on seven copies of a nineteenth-century novel
Abstract
A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Authorship Attribution and Profiling · Computational and Text Analysis Methods
