Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)
Christian Reul, Marco Dittrich, and Martin Gruner

TL;DR
This study documents a detailed, automated OCR workflow for an incunabulum, demonstrating high accuracy and significantly reduced manual effort, with implications for digitizing early printed books.
Contribution
It presents a comprehensive, automated OCR process for incunabula, comparing automated and manual layout analysis methods, and quantifies human effort reduction.
Findings
Character recognition accuracy of 97.57%
Word recognition accuracy of 92.19%
Human effort reduced from over 100 hours to less than six hours
Abstract
This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The character recognition yielded excellent results both on character (97.57%) and word (92.19%) level. Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By considerably automating the segmentation the required human effort was reduced significantly from over 100 hours to less than six hours, resulting in only a slight drop in OCR accuracy. Realistic estimates for the human effort necessary for full text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
