$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen,, Rongrong Ji

TL;DR
This paper introduces $oldsymbol{ extgamma}$-MoD, a novel mixture-of-depth adaptation strategy for multimodal large language models that significantly reduces computational costs while maintaining performance.
Contribution
The paper proposes ARank, a new metric for identifying redundant layers, and introduces shared vision-language router and masked routing learning to maximize sparsity in MLLMs.
Findings
Over 90% of dense layers converted to MoD layers.
Reduces training time by 31% and inference time by 53%.
Maintains performance with only 1.5% accuracy drop.
Abstract
Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called -MoD. In -MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods
MethodsSoftmax · Attention Is All You Need
