Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for   Vision-Language Models

Juncheng Yang; Zuchao Li; Shuai Xie; Weiping Zhu; Wei Yu; Shijun Li

arXiv:2404.12588·cs.CV·April 22, 2024

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Juncheng Yang, Zuchao Li, Shuai Xie, Weiping Zhu, Wei Yu, Shijun Li

PDF

Open Access

TL;DR

This paper introduces XMAdapter, a cross-modal adapter for vision-language models that enhances transfer learning efficiency by leveraging bimodal retrieval and dynamic affinity adjustment, outperforming previous methods.

Contribution

The work presents a novel cross-modal adapter that incorporates bimodal retrieval, dynamic affinity tuning, and adaptive learning to improve transfer learning in vision-language models.

Findings

01

Outperforms previous adapter methods in accuracy and efficiency

02

Effectively leverages bimodal retrieval for cross-modal fusion

03

Enhances generalization through adaptive sample learning

Abstract

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAdapter