Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and   Flexible Scene Text Retrieval

Gangyan Zeng; Yuan Zhang; Jin Wei; Dongbao Yang; Peng Zhang; Yiwen; Gao; Xugong Qin; Yu Zhou

arXiv:2408.00441·cs.CV·August 2, 2024

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Gangyan Zeng, Yuan Zhang, Jin Wei, Dongbao Yang, Peng Zhang, Yiwen, Gao, Xugong Qin, Yu Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces FDP, a novel CLIP-based model for efficient, OCR-free scene text retrieval that improves speed and accuracy by focusing on text regions and processing query content differently.

Contribution

The paper proposes FDP, a new model that enhances CLIP's performance in scene text retrieval by addressing perceptual and semantic challenges with attention focusing and semantic prompting.

Findings

01

FDP surpasses state-of-the-art by 4.37% on IIIT-STR benchmark.

02

FDP achieves 4x faster inference speed than existing methods.

03

FDP performs well in phrase-level and attribute-aware retrieval settings.

Abstract

Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gyann-z/fdp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Topic Modeling · Text and Document Classification Technologies

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings