$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large   Language Models

Yaxin Luo; Gen Luo; Jiayi Ji; Yiyi Zhou; Xiaoshuai Sun; Zhiqiang Shen,; Rongrong Ji

arXiv:2410.13859·cs.CV·October 18, 2024

$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen,, Rongrong Ji

PDF

Open Access 1 Models

TL;DR

This paper introduces $oldsymbol{ extgamma}$-MoD, a novel mixture-of-depth adaptation strategy for multimodal large language models that significantly reduces computational costs while maintaining performance.

Contribution

The paper proposes ARank, a new metric for identifying redundant layers, and introduces shared vision-language router and masked routing learning to maximize sparsity in MLLMs.

Findings

01

Over 90% of dense layers converted to MoD layers.

02

Reduces training time by 31% and inference time by 53%.

03

Maintains performance with only 1.5% accuracy drop.

Abstract

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called $γ$ -MoD. In $γ$ -MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
YaxinLuo/Gamma-MoD-llava-hr-13b-0.34
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods

MethodsSoftmax · Attention Is All You Need