CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao, Ruijie Quan, Linchao Zhu, Yi Yang

TL;DR
CLIP4STR leverages pre-trained vision-language models to create a simple, effective scene text recognition method that outperforms existing approaches across multiple benchmarks.
Contribution
The paper introduces CLIP4STR, a novel scene text recognition framework built on CLIP's image and text encoders, with a dual encoder-decoder architecture and a predict-and-refine decoding scheme.
Findings
Achieves state-of-the-art results on 13 STR benchmarks.
Demonstrates the effectiveness of using VLMs for scene text recognition.
Provides a comprehensive empirical study on CLIP adaptation for STR.
Abstract
Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
