OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction
Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh

TL;DR
OPTIMA introduces a scalable, one-shot pruning method for large language models that uses quadratic programming to optimize weight reconstruction, significantly improving accuracy without fine-tuning.
Contribution
The paper presents OPTIMA, a novel layer-wise quadratic programming approach for one-shot pruning that balances accuracy and scalability at large model scales.
Findings
Achieves up to 3.97% accuracy improvement over existing methods.
Prunes an 8B-parameter transformer in 40 hours on a single GPU.
Sets new state-of-the-art accuracy-efficiency trade-offs for one-shot pruning.
Abstract
Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on…
Peer Reviews
Decision·Submitted to ICLR 2026
- The methodology is presented clearly, and it’s nice to see that the authors also consider a practical implementation on actual hardware. - Setting aside the fact that the method does not propose a way to find the pruning mask, the idea of further improving a pruned LLM itself seems reasonable.
- OPTIMA is not an algorithm for finding pruning masks, but rather one for updating the unpruned weights after a pruning mask has already been determined by some means. Therefore, referring to it as “one-shot pruning for LLMs” does not seem appropriate; it is more accurately described as a post-pruning weight update algorithm. While it certainly contributes to a one-shot pruning pipeline, I believe the essence of pruning lies in determining the sparse structure itself. - A related work to menti
* One shared per-layer $H=X^TX$ turns reconstruction into **uniform QPs**, which batch naturally and saturate GPU/TPU throughput, which is a great optimization fit. * Practical plug-in: works after common maskers, pruning pipeline doesn't need to be changed. * Evaluated multiple models and sparsity settings, with generally consistent lifts over the underlying mask baseline. * Reports wall time and memory, and tries to be usable, not just theory.
* Hessian inconsistency: Sometimes reads like they use (outputs) instead of (inputs). Must be consistent and show the exact activation capture point. * Notation/shape errors: Loss decomposes by output columns, not rows. Text/algorithms say "row-wise". This is more than cosmetic and risks correctness. * Runtime under-specified: One headline number, no mean±std, no per-layer breakdown, or solver iteration counts, and unclear dependence on calibration dataset. * Robustness/variance weak: No seed
- The method seems to generally improve downstream task accuracy across both Llama and Gemma models. - The method is constructed to be tractable and outperforms gradient descent methods in terms of local error minimization.
- Equation 1 is the wrong equation to use when characterizing compression problems. The goal of compression methods (both quantization and pruning) is to minimize the end-to-end error, not the immediate activation error. The immediate activation error is only used when directly considering the end-to-end error is intractable. The effect of this is pretty clear in the empirical evaluations in this paper. Figure 2 shows that OPTIMA generally does a better job of minimizing the immediate activation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
