SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale
Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta

TL;DR
SparseSwaps introduces a scalable, efficient method for refining pruning masks in large language models, significantly reducing error and improving performance without extensive retraining.
Contribution
It proposes a novel 1-swap algorithm that simplifies mask refinement at LLM scale, enabling efficient, hyperparameter-free pruning mask optimization.
Findings
Reduces per-layer pruning error by up to 60% over previous methods.
Improves perplexity and zero-shot accuracy across GPT models.
Runs efficiently on GPUs at large scale.
Abstract
The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive…
Peer Reviews
Decision·Submitted to ICLR 2026
1. To address the core bottlenecks in LLM pruning, this paper propose an integrated framework that combines row decoupling, SVD-based compression, and a 1-swap strategy. This approach achieves substantial improvements over existing pruning methods such as DSnoT and Wanda. 2. This paper is well-motivated by three insights in Sec. 2. 3. The experiments report consistent and sometimes substantial improvements on both local pruning loss and downstream task metrics across multiple model families. 4.
1. The experiments in this paper are somewhat limited. Although results are provided for five LLM models, all of them are language models. It remains unclear how SparseSwaps performs on vision models or other types of Transformer architectures. This limitation constrains the generality and comprehensiveness of the evaluation. 2. The paper does not provide the runtime of SparseSwaps on different models or comparisons with baselines, which makes it difficult to evaluate the proposed method. 3. T
1.The paper correctly identifies a major practical limitation of sota layer-wise LLM pruning methods: the computational intractability. 2.This paper proposes three clever insights includes Row decouping, SVD Compressing and 1-Swap optimization that significantly reduce the problem's complexity with clear mathematics analysis. 3.The paper provides compelling evidence for the effectiveness of SparseSwaps across multiple modern LLM architectures and sparsity patterns (unstructured, 2:4 N:M).
1. Constraint to Per-Row Sparsity: The first insight, which enables the method's tractability, is also its main limitation. By decoupling the rows, the algorithm cannot reallocate sparsity between different rows of a weight matrix. This restricts its ability to find a truly optimal unstructured mask at the layer level, as the sparsity budget for each row is fixed by the warm-start mask. The authors acknowledge this limitation in the conclusion. 2. Computational Overhead: While the paper argues t
The paper is well-written. The main observations on row separability, unitary invariance and SVD compression, and exact 1-swap with incremental updates are convincing and directly related to the complexity bottlenecks of pruning LLMs. Using exact 1-swap search over the true objective is a new and interesting approach compared to previous LLM pruning methods which often optimize surrogates. The proposed method is well-motived based on the observations, with detailed discussion on complexity and
I don't see any major weaknesses. Perhaps the authors should consider taking account of structures within q/k/v or MLP sub-blocks into their approach to understand why some layers benefit more than others.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques
