Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning
Christian Reul, Christoph Wick, Maximilian N\"oth, Andreas B\"uttner,, Maximilian Wehner, Uwe Springmann

TL;DR
This paper presents a robust polyfont OCR model for historical Latin scripts that achieves low error rates out-of-the-box and can be efficiently fine-tuned for specific printings, enhancing automatic recognition of diverse historical texts.
Contribution
The authors develop a widely applicable mixed-model OCR for Latin scripts that outperforms standard models and simplifies fine-tuning for specific historical printings.
Findings
Achieved CER of 1.73% on unseen books, outperforming standard models.
Fine-tuning from the mixed model reduces CER by up to 50%.
The model is openly available for community use.
Abstract
In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
