M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring   Expression Comprehension

Xuyang Liu; Ting Liu; Siteng Huang; Yi Xin; Yue Hu; Quanjun Yin,; Donglin Wang; Yuanyuan Wu; Honggang Chen

arXiv:2407.01131·cs.CV·March 14, 2025

M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Xuyang Liu, Ting Liu, Siteng Huang, Yi Xin, Yue Hu, Quanjun Yin,, Donglin Wang, Yuanyuan Wu, Honggang Chen

PDF

Open Access

TL;DR

M2IST introduces a parameter-efficient method for referring expression comprehension that enhances multi-modal interaction while significantly reducing computational costs compared to full fine-tuning.

Contribution

The paper proposes M2IST, a novel multi-modal side-tuning approach with M3ISAs, enabling efficient vision-language alignment without extensive parameter updates.

Findings

01

Outperforms full fine-tuning and other PETL methods in efficiency and performance.

02

Uses only 2.11% of tunable parameters, reducing GPU memory and training time.

03

Maintains competitive accuracy in referring expression comprehension.

Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M3ISAs to enable efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques