Profiling of OCR'ed Historical Texts Revisited

Florian Fink; Klaus-U. Schulz; Uwe Springmann

arXiv:1701.05377·cs.CV·January 20, 2017

Profiling of OCR'ed Historical Texts Revisited

Florian Fink, Klaus-U. Schulz, Uwe Springmann

PDF

Open Access

TL;DR

This paper enhances a statistical profiling method for OCR'ed historical texts by making it adaptive, incorporating new historical patterns, and utilizing uninterpretable tokens to improve error detection and postcorrection accuracy.

Contribution

It introduces an adaptive, feedback-aware extension of the existing profiling method, improving error recognition and error class differentiation in OCR'ed historical texts.

Findings

01

Adaptive profiling improves OCR error recognition.

02

Adding historical patterns enhances error discrimination.

03

Utilizing uninterpretable tokens increases error detection recall.

Abstract

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in Reffle (2013) computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Here we improve this method in three respects: First, the method in Reffle (2013) is not adaptive: user feedback obtained by actual postcorrection steps cannot be used to compute refined profiles. We introduce a variant of the method that is open for adaptivity, taking correction steps of the user into…

Figures1

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Handwritten Text Recognition Techniques · Natural Language Processing Techniques

Full text

See pages 1-last of 2017-01-19-Datech2017.pdf