One Last Attention for Your Vision-Language Model

Liang Chen; Ghazi Shazan Ahmad; Tianjun Yao; Lingqiao Liu; Zhiqiang Shen

arXiv:2507.15480·cs.CV·July 29, 2025

One Last Attention for Your Vision-Language Model

Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen

PDF

Open Access

TL;DR

This paper introduces RAda, a simple fine-tuning method for vision-language models that dynamically adjusts the contribution of fused representations during adaptation, enhancing performance with minimal modifications.

Contribution

The paper proposes RAda, a novel fine-tuning approach that explicitly leverages the fused representation in VLMs using a learned mask, improving adaptation effectiveness.

Findings

01

RAda improves zero-shot and fine-tuned performance across various settings.

02

It requires minimal code changes and computational overhead.

03

RAda performs comparably or better than existing methods in experiments.

Abstract

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis