SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale

Max Zimmer; Christophe Roux; Moritz Wagner; Deborah Hendrych; Sebastian Pokutta

arXiv:2512.10922·cs.LG·February 3, 2026

SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale

Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta

PDF

Open Access 3 Reviews

TL;DR

SparseSwaps introduces a scalable, efficient method for refining pruning masks in large language models, significantly reducing error and improving performance without extensive retraining.

Contribution

It proposes a novel 1-swap algorithm that simplifies mask refinement at LLM scale, enabling efficient, hyperparameter-free pruning mask optimization.

Findings

01

Reduces per-layer pruning error by up to 60% over previous methods.

02

Improves perplexity and zero-shot accuracy across GPT models.

03

Runs efficiently on GPUs at large scale.

Abstract

The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. To address the core bottlenecks in LLM pruning, this paper propose an integrated framework that combines row decoupling, SVD-based compression, and a 1-swap strategy. This approach achieves substantial improvements over existing pruning methods such as DSnoT and Wanda. 2. This paper is well-motivated by three insights in Sec. 2. 3. The experiments report consistent and sometimes substantial improvements on both local pruning loss and downstream task metrics across multiple model families. 4.

Weaknesses

1. The experiments in this paper are somewhat limited. Although results are provided for five LLM models, all of them are language models. It remains unclear how SparseSwaps performs on vision models or other types of Transformer architectures. This limitation constrains the generality and comprehensiveness of the evaluation. 2. The paper does not provide the runtime of SparseSwaps on different models or comparisons with baselines, which makes it difficult to evaluate the proposed method. 3. T

Reviewer 02Rating 6Confidence 2

Strengths

1.The paper correctly identifies a major practical limitation of sota layer-wise LLM pruning methods: the computational intractability. 2.This paper proposes three clever insights includes Row decouping, SVD Compressing and 1-Swap optimization that significantly reduce the problem's complexity with clear mathematics analysis. 3.The paper provides compelling evidence for the effectiveness of SparseSwaps across multiple modern LLM architectures and sparsity patterns (unstructured, 2:4 N:M).

Weaknesses

1. Constraint to Per-Row Sparsity: The first insight, which enables the method's tractability, is also its main limitation. By decoupling the rows, the algorithm cannot reallocate sparsity between different rows of a weight matrix. This restricts its ability to find a truly optimal unstructured mask at the layer level, as the sparsity budget for each row is fixed by the warm-start mask. The authors acknowledge this limitation in the conclusion. 2. Computational Overhead: While the paper argues t

Reviewer 03Rating 8Confidence 4

Strengths

The paper is well-written. The main observations on row separability, unitary invariance and SVD compression, and exact 1-swap with incremental updates are convincing and directly related to the complexity bottlenecks of pruning LLMs. Using exact 1-swap search over the true objective is a new and interesting approach compared to previous LLM pruning methods which often optimize surrogates. The proposed method is well-motived based on the observations, with detailed discussion on complexity and

Weaknesses

I don't see any major weaknesses. Perhaps the authors should consider taking account of structures within q/k/v or MLP sub-blocks into their approach to understand why some layers benefit more than others.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques