Fast Search with Poor OCR

Taivanbat Badamdorj; Adiel Ben-Shalom; Nachum Dershowitz; Lior Wolf

arXiv:1909.07899·cs.IR·April 23, 2020·1 cites

Fast Search with Poor OCR

Taivanbat Badamdorj, Adiel Ben-Shalom, Nachum Dershowitz, Lior Wolf

PDF

Open Access

TL;DR

This paper introduces a novel vector-based search method for historical documents with poor OCR quality, enabling efficient retrieval despite noisy text, demonstrated on WWII German documents.

Contribution

The paper presents a new text search approach that uses vector representations and nearest-neighbor search to handle noisy OCR outputs effectively.

Findings

01

Effective retrieval on WWII German documents

02

Robustness to OCR noise demonstrated

03

Practical search system for historical texts

Abstract

The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques