Deterministic Differentiable Structured Pruning for Large Language Models
Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen

TL;DR
This paper introduces Deterministic Differentiable Pruning (DDP), a novel method for structured pruning of large language models that improves efficiency and reduces train-test mismatch by directly optimizing a deterministic surrogate of the l0 norm.
Contribution
The paper proposes DDP, a deterministic mask optimization approach for structured pruning, outperforming stochastic methods in efficiency and accuracy on large language models.
Findings
Achieves as low as 1% performance loss at 20% sparsity.
Outperforms previous methods in structured pruning.
Demonstrates end-to-end inference speedups in deployment.
Abstract
Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
