SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

Liu Hanzuo; Chaofan Lin; Weixuan Sun; Yulong Wang; Key; Rayying; Mingyu Gao

arXiv:2605.06402·cs.LG·May 8, 2026

SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

Liu Hanzuo, Chaofan Lin, Weixuan Sun, Yulong Wang, Key, Rayying, Mingyu Gao

PDF

TL;DR

SparseForge is a post-training sparsification method for large language models that efficiently improves accuracy by optimizing sparsity masks with Hessian guidance and annealing, reducing retraining costs.

Contribution

It introduces a novel Hessian-aware soft mask annealing technique for semi-structured LLM sparsification that requires significantly fewer retraining tokens.

Findings

01

Achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity with only 5B retraining tokens.

02

Surpasses dense model accuracy (56.43%) and approaches state-of-the-art sparsification methods using 40B tokens.

03

Demonstrates consistent accuracy-efficiency improvements across different model families.

Abstract

Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $5B$ retraining tokens, surpassing the dense model's 56.43%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.