MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models
Yuncheng Guo, Xiaodong Gu

TL;DR
MMRL++ introduces a parameter-efficient, interaction-aware approach for vision-language models that enhances cross-modal interactions and generalization, especially in few-shot learning scenarios, by utilizing shared, learnable representation tokens and a decoupling inference strategy.
Contribution
The paper proposes MMRL++, a novel extension that reduces trainable parameters and improves intra-modal interactions, advancing the adaptation of vision-language models with limited data.
Findings
Outperforms state-of-the-art methods on 15 datasets
Balances task-specific adaptation and generalization effectively
Enhances intra-modal interactions through a new extension
Abstract
Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsALIGN · Context Optimization · Contrastive Language-Image Pre-training · Balanced Selection
