TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao

TL;DR
TextRefiner enhances vision-language prompt tuning by internally refining text prompts using local visual features, leading to improved performance without external knowledge or high inference costs.
Contribution
It introduces a local cache module that leverages internal visual tokens to refine text prompts, surpassing existing methods in efficiency and accuracy.
Findings
Improves CoOp performance from 71.66% to 76.94% on 11 benchmarks.
Outperforms CoCoOp by integrating instance-wise features.
Achieves state-of-the-art results with efficient inference.
Abstract
Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContext Optimization
