CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Aviad Aberdam; David Bensa\"id; Alona Golts; Roy Ganz; Oren Nuriel,; Royee Tichauer; Shai Mazor; Ron Litman

arXiv:2301.07464·cs.CV·July 25, 2023

CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Aviad Aberdam, David Bensa\"id, Alona Golts, Roy Ganz, Oren Nuriel,, Royee Tichauer, Shai Mazor, Ron Litman

PDF

Open Access

TL;DR

CLIPTER enhances scene text recognition by integrating scene-level context from vision-language models like CLIP, leading to improved accuracy, robustness, and generalization across benchmarks.

Contribution

This work introduces a novel framework that fuses scene-level context with crop-based recognizers using gated cross-attention, improving performance and robustness.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Improves robustness to out-of-vocabulary words.

03

Enhances generalization in low-data regimes.

Abstract

Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training