CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
Aviad Aberdam, David Bensa\"id, Alona Golts, Roy Ganz, Oren Nuriel,, Royee Tichauer, Shai Mazor, Ron Litman

TL;DR
CLIPTER enhances scene text recognition by integrating scene-level context from vision-language models like CLIP, leading to improved accuracy, robustness, and generalization across benchmarks.
Contribution
This work introduces a novel framework that fuses scene-level context with crop-based recognizers using gated cross-attention, improving performance and robustness.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Improves robustness to out-of-vocabulary words.
Enhances generalization in low-data regimes.
Abstract
Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
