Automatic quality evaluation and (semi-) automatic improvement of OCR   models for historical printings

U. Springmann; F. Fink; K. U. Schulz

arXiv:1606.05157·cs.DL·October 21, 2016·19 cites

Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings

U. Springmann, F. Fink, K. U. Schulz

PDF

Open Access

TL;DR

This paper proposes a strategy for improving OCR accuracy on historical printings by using mixed models trained on multiple fonts, combined with automatic and semi-automatic refinement methods that require minimal manual transcriptions.

Contribution

It introduces a novel approach to generalize OCR models across similar fonts without extensive ground truth, and presents methods to evaluate recognition quality without ground truth.

Findings

01

Mixed models achieve over 90% character accuracy on unseen fonts.

02

Semi-automatic correction with 100-200 lines significantly reduces error rates.

03

New quality metrics correlate well with true OCR accuracy.

Abstract

Good OCR results for historical printings rely on the availability of recognition models trained on diplomatic transcriptions as ground truth, which is both a scarce resource and time-consuming to generate. Instead of having to train a separate model for each historical typeface, we propose a strategy to start from models trained on a combined set of available transcriptions in a variety of fonts. These \emph{mixed models} result in character accuracy rates over 90\% on a test set of printings from the same period of time, but without any representation in the training data, demonstrating the possibility to overcome the typography barrier by generalizing from a few typefaces to a larger set of (similar) fonts in use over a period of time. The output of these mixed models is then used as a baseline to be further improved by both fully automatic methods and semi-automatic methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing · Natural Language Processing Techniques