CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained   Vision-Language Model

Shuai Zhao; Ruijie Quan; Linchao Zhu; Yi Yang

arXiv:2305.14014·cs.CV·December 25, 2024·6 cites

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Shuai Zhao, Ruijie Quan, Linchao Zhu, Yi Yang

PDF

Open Access 1 Repo 1 Models

TL;DR

CLIP4STR leverages pre-trained vision-language models to create a simple, effective scene text recognition method that outperforms existing approaches across multiple benchmarks.

Contribution

The paper introduces CLIP4STR, a novel scene text recognition framework built on CLIP's image and text encoders, with a dual encoder-decoder architecture and a predict-and-refine decoding scheme.

Findings

01

Achieves state-of-the-art results on 13 STR benchmarks.

02

Demonstrates the effectiveness of using VLMs for scene text recognition.

03

Provides a comprehensive empirical study on CLIP adaptation for STR.

Abstract

Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VamosC/CLIP4STR
pytorch

Models

🤗
mzhaoshuai/CLIP4STR
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training