MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo; Xiaodong Gu

arXiv:2505.10088·cs.CV·May 16, 2025

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo, Xiaodong Gu

PDF

Open Access 1 Repo

TL;DR

MMRL++ introduces a parameter-efficient, interaction-aware approach for vision-language models that enhances cross-modal interactions and generalization, especially in few-shot learning scenarios, by utilizing shared, learnable representation tokens and a decoupling inference strategy.

Contribution

The paper proposes MMRL++, a novel extension that reduces trainable parameters and improves intra-modal interactions, advancing the adaptation of vision-language models with limited data.

Findings

01

Outperforms state-of-the-art methods on 15 datasets

02

Balances task-specific adaptation and generalization effectively

03

Enhances intra-modal interactions through a new extension

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunncheng/MMRL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsALIGN · Context Optimization · Contrastive Language-Image Pre-training · Balanced Selection