One Last Attention for Your Vision-Language Model
Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen

TL;DR
This paper introduces RAda, a simple fine-tuning method for vision-language models that dynamically adjusts the contribution of fused representations during adaptation, enhancing performance with minimal modifications.
Contribution
The paper proposes RAda, a novel fine-tuning approach that explicitly leverages the fused representation in VLMs using a learned mask, improving adaptation effectiveness.
Findings
RAda improves zero-shot and fine-tuned performance across various settings.
It requires minimal code changes and computational overhead.
RAda performs comparably or better than existing methods in experiments.
Abstract
Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
