CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality
Marco Peer, Anna Scius-Bertrand, Andreas Fischer

TL;DR
This paper presents a self-training CTC alignment method to improve handwritten text recognition accuracy and annotation quality in historical documents, specifically addressing hyphenation errors in 16th-century letters.
Contribution
We introduce a novel CTC alignment-based self-training approach that enhances annotation accuracy and demonstrates iterative improvements in handwritten text recognition for historical documents.
Findings
Improved CER by 1.1 percentage points with PyLaia
Weaker models produce more accurate alignments
Iterative training further enhances recognition and alignment quality
Abstract
Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Natural Language Processing Techniques
