p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, Limin Wang

TL;DR
p-MoD introduces an efficient multimodal large language model architecture that reduces computational costs by selectively processing vision tokens through a mixture-of-depths mechanism, enhanced with novel training strategies.
Contribution
The paper presents p-MoD, a novel architecture integrating Mixture-of-Depths with progressive ratio decay and new normalization techniques to improve efficiency without sacrificing performance.
Findings
Achieves comparable or better performance than baseline models on 15 benchmarks.
Reduces inference TFLOPs by 44.4% and KV cache storage by 46.3%.
Cuts training GPU hours by 22.3%.
Abstract
Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced biosensing and bioanalysis techniques
MethodsWeight Normalization
