Compressing LLMs with MoP: Mixture of Pruners
Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

TL;DR
MoP introduces an iterative mixture of pruning strategies for LLMs, unifying depth and width pruning, leading to significant speedups and improved accuracy across multiple models and tasks.
Contribution
The paper presents MoP, a novel iterative framework that combines depth and width pruning, advancing structured pruning techniques for large language models.
Findings
MoP outperforms existing pruning methods in accuracy across various compression regimes.
MoP achieves a 39% reduction in end-to-end latency at 40% compression.
Extending MoP to vision-language models improves computational efficiency and maintains performance after fine-tuning.
Abstract
The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
