Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription
Jaydeep Borkar, David A. Smith

TL;DR
This paper presents a transformer-based OCR approach trained on synthetic data to detect and restore lacunae in historical documents, significantly improving restoration success and enabling error detection without image inspection.
Contribution
The study introduces a supervised transformer OCR model trained on synthetic lacunae data, demonstrating improved restoration and error detection capabilities in damaged historical documents.
Findings
Achieved 65% success in lacunae restoration
Base model without lacunae knowledge achieved only 5% restoration
Log probability metrics can identify lacunae and transcription errors
Abstract
Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics
MethodsSoftmax · Attention Is All You Need · Balanced Selection
