Bounding the Probability of Error for High Precision Recognition

Andrew Kae; Gary B. Huang; Erik Learned-Miller

arXiv:0907.0418·cs.CV·July 3, 2009·2 cites

Bounding the Probability of Error for High Precision Recognition

Andrew Kae, Gary B. Huang, Erik Learned-Miller

PDF

Open Access

TL;DR

This paper introduces a novel method for OCR systems to identify a small subset of words with extremely high confidence and near-zero error, enabling more accurate document processing.

Contribution

It presents a new technique for bounding the probability of error in OCR, allowing high-precision word selection without relying heavily on posterior probability estimates.

Findings

01

Identified about 6% of words with zero errors in noisy documents

02

Bounded error probabilities under general assumptions

03

Enabled high-precision word list creation for improved OCR performance

Abstract

We consider models for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low rates of recall. If some variables can be identified with near certainty, then they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data. While many current OCR systems produce measures of confidence for the identity of each letter or word, thresholding these confidence values, even at very high values, still produces some errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Algorithms and Data Compression · Machine Learning and Algorithms