Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings
U. Springmann, F. Fink, K. U. Schulz

TL;DR
This paper proposes a strategy for improving OCR accuracy on historical printings by using mixed models trained on multiple fonts, combined with automatic and semi-automatic refinement methods that require minimal manual transcriptions.
Contribution
It introduces a novel approach to generalize OCR models across similar fonts without extensive ground truth, and presents methods to evaluate recognition quality without ground truth.
Findings
Mixed models achieve over 90% character accuracy on unseen fonts.
Semi-automatic correction with 100-200 lines significantly reduces error rates.
New quality metrics correlate well with true OCR accuracy.
Abstract
Good OCR results for historical printings rely on the availability of recognition models trained on diplomatic transcriptions as ground truth, which is both a scarce resource and time-consuming to generate. Instead of having to train a separate model for each historical typeface, we propose a strategy to start from models trained on a combined set of available transcriptions in a variety of fonts. These \emph{mixed models} result in character accuracy rates over 90\% on a test set of printings from the same period of time, but without any representation in the training data, demonstrating the possibility to overcome the typography barrier by generalizing from a few typefaces to a larger set of (similar) fonts in use over a period of time. The output of these mixed models is then used as a baseline to be further improved by both fully automatic methods and semi-automatic methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing · Natural Language Processing Techniques
