Probabilistic Management of OCR Data using an RDBMS
Arun Kumar, Christopher R\'e

TL;DR
This paper introduces a probabilistic approach to manage OCR data within relational databases, retaining uncertainty to improve query accuracy, and proposes an approximation scheme called Staccato for balancing performance and recall.
Contribution
The paper presents a novel method for integrating probabilistic OCR outputs into RDBMS, including an approximation scheme that balances recall and performance, with formal analysis and indexing integration.
Findings
Staccato significantly improves query performance while maintaining high recall.
Probabilistic OCR data integration enhances retrieval accuracy over traditional ASCII storage.
The scheme is formally analyzed and effectively integrated with standard RDBMS indexing.
Abstract
The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Algorithms and Data Compression
