Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

TL;DR
M3 introduces nested visual token representations in multimodal models, enabling flexible control over visual detail and improving efficiency without sacrificing accuracy, especially in dense visual scenarios.
Contribution
The paper proposes a novel nested token framework for multimodal models, allowing adjustable visual granularity and detailed analysis of dataset requirements and performance trade-offs.
Findings
COCO-style benchmarks need only ~9 tokens for similar accuracy
M3 enables explicit control of visual granularity during inference
Significant gap exists between oracle upper bound and fixed-scale representations
Abstract
Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the…
Peer Reviews
Decision·ICLR 2025 Poster
* The paper tackles a meaningful problem in practice. The motivation is coherent and easy-to-follow. * The method description is mostly clear. * Extensive experiments with strong results and insightful analysis.
* **Comparison with Flamingo-style (i.e., cross-attention-based) methods**: Despite the popularity of LLaVa-style MLLMs which treat visual tokens as prompt, Flamingo-style MLLMs, which decode text-conditioned salient visual features with cross-attention modules, are also studied as an alternative paradigm in several previous works, e.g., [1, 2]. It's noteworthy that cross-attention alleviates most of the the performance penalty due to long visual sequences by nature, because the visual tokens do
- The motivation is clear. Current LMMs need more and more visual tokens to enhance their performance, the study of token reduction is important for efficient LMMs. - The method is simple and easy to implement. Instead of tuning LLM for accepting varing number of tokens, M3 shows that tuning CLIP also works. - The main evaluation and ablation analysis confirm M3's effectiveness.
- Comparisions with dynamic sampling methods like Token Merging and Chat-Univi [1]. The performance drop is significant when reducing the number of tokens, while dynamic sampling methods like Chat-Univi can even surpass its full token baseline. Besides, M3 can be regarded as a special case of dynamic sampling. I suggest a fair comparison with these methods. - High-resolution and long video evaluation and comparisons with other works (LLaVA-HD, SPHINX, LLaMA-VID etc.) . Since these tasks usually
- This paper is well-motivated, addressing the important capability of representing visual information at varying levels of granularity. This flexibility enables adjusting the number of visual tokens based on both computational budget and task complexity. - The proposed method is effective, demonstrating comparable performance with the baseline LMMs (LLaVA-1.5 and LLaVA-Next) while using significantly fewer visual tokens, across benchmarks that do not demand dense visual perception. - The empiri
- Although M3 can produce visual representations at multiple granularity levels, the number of visual tokens used at inference must be predefined. In other words, the method cannot adaptively adjust the number of visual tokens for different instances. - The baseline methods used for video understanding are relatively weak. For example, recent 7B-scale VLMs have achieved over 60% accuracy on EgoSchema, while the best baseline in this work only reaches 35.8%. M3 would likely benefit from integrati
Code & Models
Videos
Taxonomy
TopicsAdvanced Research in Systems and Signal Processing
