Profiling of OCR'ed Historical Texts Revisited
Florian Fink, Klaus-U. Schulz, Uwe Springmann

TL;DR
This paper enhances a statistical profiling method for OCR'ed historical texts by making it adaptive, incorporating new historical patterns, and utilizing uninterpretable tokens to improve error detection and postcorrection accuracy.
Contribution
It introduces an adaptive, feedback-aware extension of the existing profiling method, improving error recognition and error class differentiation in OCR'ed historical texts.
Findings
Adaptive profiling improves OCR error recognition.
Adding historical patterns enhances error discrimination.
Utilizing uninterpretable tokens increases error detection recall.
Abstract
In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in Reffle (2013) computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Here we improve this method in three respects: First, the method in Reffle (2013) is not adaptive: user feedback obtained by actual postcorrection steps cannot be used to compute refined profiles. We introduce a variant of the method that is open for adaptivity, taking correction steps of the user into…
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
See pages 1-last of 2017-01-19-Datech2017.pdf
