Bounding the Probability of Error for High Precision Recognition
Andrew Kae, Gary B. Huang, Erik Learned-Miller

TL;DR
This paper introduces a novel method for OCR systems to identify a small subset of words with extremely high confidence and near-zero error, enabling more accurate document processing.
Contribution
It presents a new technique for bounding the probability of error in OCR, allowing high-precision word selection without relying heavily on posterior probability estimates.
Findings
Identified about 6% of words with zero errors in noisy documents
Bounded error probabilities under general assumptions
Enabled high-precision word list creation for improved OCR performance
Abstract
We consider models for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low rates of recall. If some variables can be identified with near certainty, then they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data. While many current OCR systems produce measures of confidence for the identity of each letter or word, thresholding these confidence values, even at very high values, still produces some errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Algorithms and Data Compression · Machine Learning and Algorithms
