Compressing LLMs with MoP: Mixture of Pruners

Bruno Lopes Yamamoto; Lucas Lauton de Alcantara; Victor Zacarias; Leandro Giusti Mugnaini; Keith Ando Ogawa; Lucas Pellicer; Rosimeire Pereira Costa; Edson Bollis; Anna Helena Reali Costa; Artur Jordao

arXiv:2602.06127·cs.LG·February 9, 2026

Compressing LLMs with MoP: Mixture of Pruners

Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

PDF

Open Access

TL;DR

MoP introduces an iterative mixture of pruning strategies for LLMs, unifying depth and width pruning, leading to significant speedups and improved accuracy across multiple models and tasks.

Contribution

The paper presents MoP, a novel iterative framework that combines depth and width pruning, advancing structured pruning techniques for large language models.

Findings

01

MoP outperforms existing pruning methods in accuracy across various compression regimes.

02

MoP achieves a 39% reduction in end-to-end latency at 40% compression.

03

Extending MoP to vision-language models improves computational efficiency and maintains performance after fine-tuning.

Abstract

The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning