TextRefiner: Internal Visual Feature as Efficient Refiner for   Vision-Language Models Prompt Tuning

Jingjing Xie; Yuxin Zhang; Jun Peng; Zhaohong Huang; Liujuan Cao

arXiv:2412.08176·cs.CV·December 12, 2024

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao

PDF

Open Access 1 Repo 1 Video

TL;DR

TextRefiner enhances vision-language prompt tuning by internally refining text prompts using local visual features, leading to improved performance without external knowledge or high inference costs.

Contribution

It introduces a local cache module that leverages internal visual tokens to refine text prompts, surpassing existing methods in efficiency and accuracy.

Findings

01

Improves CoOp performance from 71.66% to 76.94% on 11 benchmarks.

02

Outperforms CoCoOp by integrating instance-wise features.

03

Achieves state-of-the-art results with efficient inference.

Abstract

Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xjjxmu/textrefiner
pytorchOfficial

Videos

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContext Optimization