Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box   Recognition and Finetuning

Christian Reul; Christoph Wick; Maximilian N\"oth; Andreas B\"uttner,; Maximilian Wehner; Uwe Springmann

arXiv:2106.07881·cs.CV·June 16, 2021

Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

Christian Reul, Christoph Wick, Maximilian N\"oth, Andreas B\"uttner,, Maximilian Wehner, Uwe Springmann

PDF

TL;DR

This paper presents a robust polyfont OCR model for historical Latin scripts that achieves low error rates out-of-the-box and can be efficiently fine-tuned for specific printings, enhancing automatic recognition of diverse historical texts.

Contribution

The authors develop a widely applicable mixed-model OCR for Latin scripts that outperforms standard models and simplifies fine-tuning for specific historical printings.

Findings

01

Achieved CER of 1.73% on unseen books, outperforming standard models.

02

Fine-tuning from the mixed model reduces CER by up to 50%.

03

The model is openly available for community use.

Abstract

In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.