Turning a CLIP Model into a Scene Text Spotter
Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai

TL;DR
This paper transforms the CLIP model into a powerful backbone for scene text detection and spotting, leveraging visual prompts and cross-attention to improve accuracy, speed, and robustness, especially in few-shot and out-of-distribution scenarios.
Contribution
It introduces FastTCM-CR50, a novel CLIP-based backbone that enhances text detection and spotting with visual prompts, cross-attention, and dynamic language prompts, outperforming previous methods.
Findings
Improves existing detectors by 1.7% and spotters by 1.5%.
Outperforms previous TCM-CR50 backbone with 0.2% and 0.56% gains.
Achieves 48.5% faster inference and strong few-shot performance.
Abstract
We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
