p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang; Desen Meng; Zhengming Zhang; Zhenpeng Huang; Tao Wu; Limin Wang

arXiv:2412.04449·cs.CV·August 7, 2025

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, Limin Wang

PDF

Open Access 1 Repo 2 Models

TL;DR

p-MoD introduces an efficient multimodal large language model architecture that reduces computational costs by selectively processing vision tokens through a mixture-of-depths mechanism, enhanced with novel training strategies.

Contribution

The paper presents p-MoD, a novel architecture integrating Mixture-of-Depths with progressive ratio decay and new normalization techniques to improve efficiency without sacrificing performance.

Findings

01

Achieves comparable or better performance than baseline models on 15 benchmarks.

02

Reduces inference TFLOPs by 44.4% and KV cache storage by 46.3%.

03

Cuts training GPU hours by 22.3%.

Abstract

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mcg-nju/p-mod
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced biosensing and bioanalysis techniques

MethodsWeight Normalization