Turning a CLIP Model into a Scene Text Spotter

Wenwen Yu; Yuliang Liu; Xingkui Zhu; Haoyu Cao; Xing Sun; Xiang Bai

arXiv:2308.10408·cs.CV·August 22, 2023

Turning a CLIP Model into a Scene Text Spotter

Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai

PDF

Open Access 1 Repo

TL;DR

This paper transforms the CLIP model into a powerful backbone for scene text detection and spotting, leveraging visual prompts and cross-attention to improve accuracy, speed, and robustness, especially in few-shot and out-of-distribution scenarios.

Contribution

It introduces FastTCM-CR50, a novel CLIP-based backbone that enhances text detection and spotting with visual prompts, cross-attention, and dynamic language prompts, outperforming previous methods.

Findings

01

Improves existing detectors by 1.7% and spotters by 1.5%.

02

Outperforms previous TCM-CR50 backbone with 0.2% and 0.56% gains.

03

Achieves 48.5% faster inference and strong few-shot performance.

Abstract

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenwenyu/tcm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training