RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

Xiang Lin; Weixin Li; Shu Guo; Lihong Wang; Di Huang

arXiv:2512.06811·cs.CV·December 9, 2025

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

Xiang Lin, Weixin Li, Shu Guo, Lihong Wang, Di Huang

PDF

Open Access 1 Video

TL;DR

RMAdapter introduces a dual-branch, reconstruction-based approach to fine-tune vision-language models, balancing task-specific adaptation and general knowledge preservation, leading to improved performance across various generalization tasks.

Contribution

The paper proposes a novel lightweight dual-branch RMAdapter that combines adaptation and reconstruction to enhance fine-tuning of VLMs in few-shot scenarios.

Findings

01

Outperforms state-of-the-art methods on multiple generalization tasks

02

Effectively balances task-specific knowledge and general knowledge

03

Maintains low computational overhead despite additional reconstruction branch

Abstract

Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling