RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
Xiang Lin, Weixin Li, Shu Guo, Lihong Wang, Di Huang

TL;DR
RMAdapter introduces a dual-branch, reconstruction-based approach to fine-tune vision-language models, balancing task-specific adaptation and general knowledge preservation, leading to improved performance across various generalization tasks.
Contribution
The paper proposes a novel lightweight dual-branch RMAdapter that combines adaptation and reconstruction to enhance fine-tuning of VLMs in few-shot scenarios.
Findings
Outperforms state-of-the-art methods on multiple generalization tasks
Effectively balances task-specific knowledge and general knowledge
Maintains low computational overhead despite additional reconstruction branch
Abstract
Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
