Fast Search with Poor OCR
Taivanbat Badamdorj, Adiel Ben-Shalom, Nachum Dershowitz, Lior Wolf

TL;DR
This paper introduces a novel vector-based search method for historical documents with poor OCR quality, enabling efficient retrieval despite noisy text, demonstrated on WWII German documents.
Contribution
The paper presents a new text search approach that uses vector representations and nearest-neighbor search to handle noisy OCR outputs effectively.
Findings
Effective retrieval on WWII German documents
Robustness to OCR noise demonstrated
Practical search system for historical texts
Abstract
The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
