Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval
Gangyan Zeng, Yuan Zhang, Jin Wei, Dongbao Yang, Peng Zhang, Yiwen, Gao, Xugong Qin, Yu Zhou

TL;DR
This paper introduces FDP, a novel CLIP-based model for efficient, OCR-free scene text retrieval that improves speed and accuracy by focusing on text regions and processing query content differently.
Contribution
The paper proposes FDP, a new model that enhances CLIP's performance in scene text retrieval by addressing perceptual and semantic challenges with attention focusing and semantic prompting.
Findings
FDP surpasses state-of-the-art by 4.37% on IIIT-STR benchmark.
FDP achieves 4x faster inference speed than existing methods.
FDP performs well in phrase-level and attribute-aware retrieval settings.
Abstract
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
