SparseSSM: Efficient Selective Structured State Space Models Can Be Pruned in One-Shot
Kaiwen Tuo, Huan Wang

TL;DR
SparseSSM introduces a training-free, layer-wise pruning method for state-space language models, enabling 50% weight reduction without accuracy loss, thus improving deployment efficiency for large models.
Contribution
It extends the optimal brain surgeon framework to state space models, enabling effective one-shot pruning without fine-tuning and providing insights into model redundancy.
Findings
Prunes 50% of SSM weights with no zero-shot accuracy loss.
Achieves state-of-the-art pruning performance for Mamba-based LLMs.
Supports extension to structured sparsity.
Abstract
State-space language models such as Mamba match Transformer quality while permitting linear complexity inference, yet still comprise billions of parameters that hinder deployment. Existing one-shot pruning methods are tailored to attention blocks and fail to account for the time-shared and discretized state-transition matrix at the heart of the selective state-space module (SSM). In this paper, we introduce SparseSSM, the first training-free pruning framework that extends the classic optimal brain surgeon (OBS) framework to state space architectures. Our layer-wise algorithm (i) derives an approximate second-order saliency score that aggregates Hessian-trace information across time steps, (ii) incorporates a component sensitivity analysis to guide feed-forward network (FFN) pruning, which also sheds light on where redundancy resides in mamba architecture, (iii) can be easily extended to…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper clearly presents its method and provides comprehensive numerical results across multiple Mamba checkpoints, demonstrating consistent performance retention under high sparsity levels. The proposed SparseSSM method is conceptually sound, extending the Optimal Brain Surgeon framework to selective state-space models in a theoretically grounded way. The layer-wise Hessian-based importance estimation and time-weighted aggregation strategy are well-motivated and novel in the context of SSMs.
While the proposed SparseSSM method is well-motivated and empirically effective, several important aspects remain insufficiently addressed. First, since Mamba and other state-space models already benefit from efficient recurrent inference, it is unclear whether pruning further improves real-world inference speed or hardware efficiency. The paper does not provide detailed runtime, memory, or energy analyses to substantiate the claimed computational benefits. Second, evaluating performance solely
- The idea is clear and is executed well. The authors also point out the limitations/challenges of applying it more broadly to the non-SSM portions of the architecture. - The method is training-free, making it tractable to apply to pretrained models in an inexpensive way. - The chosen baselines and experiments make sense. The results are generally good when compared to the chosen baselines on the tested datasets. The paper is also technically clear and easy to read and understand. - The work
- This paper (the way it is written) has a fundamental flaw. The core motivation mentions that Mamba(1 and 2) models have billions of parameters which hinders deployment. However, > 95% of the parameter count lies in the in_proj and out_proj layers that are just fully connected projection layers. The paper instead focuses on the SSM layers instead, which (especially in Mamba2) have much lesser impact on the overall efficiency (in both parallel scan and recurrent inference modes). At the very lea
Originality: The paper introduces an extension to the OBS pruning framework tailored to Mamba-based state-space models, addressing architectural challenges—like time-shared and discretized parameters—not considered in prior work. Quality: The approach has rigorous theoretical derivations and an empirical evaluation spanning different model sizes, pruning granularities, and both language modeling and reasoning benchmarks. The inclusion of ablation studies and sensitivity analyses validates the
1. Limited Baseline Comparison: The experimental evaluation primarily contrasts the proposed method with magnitude pruning, SparseGPT, and Mamba-Shedder. However, recent pruning approaches such as Wanda are not included. Further evaluation would clarify whether the paper’s architectural adaptations are essential and show real empirical advantages. 2. Experimental Scope and Generalization: Although multiple scales of Mamba models are evaluated, the experiments are limited to Mamba architecture a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsLinear Layer · Adam · Byte Pair Encoding · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Multi-Head Attention · Dropout · Label Smoothing · Dense Connections · Residual Connection
