GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Bin Wang, Ruotong Hu, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang, Xudong Jiang

TL;DR
GA2-CLIP introduces a novel prompt tuning framework for video-language models that enhances generalization by using external prompts and attribute anchors, outperforming existing methods.
Contribution
The paper proposes a plug-and-play coupling prompt learning method with attribute anchors to improve VLM generalization in video tasks, addressing semantic space narrowing.
Findings
Significantly outperforms state-of-the-art prompt tuning methods.
Improves base-to-new class prediction accuracy.
Effectively maintains attribute relevance and model generalization.
Abstract
Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
