SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao; Yingbo Hao; Ting Song; Yan Xia; Di Zhang; Shaohan Huang; Xun Wu; Songchen Xu; Le Xu; Li Dong; Zewen Chi; Yi Zou; Furu Wei

arXiv:2603.05232·cs.LG·March 6, 2026

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

PDF

Open Access

TL;DR

SlideSparse introduces a novel system that enables efficient acceleration of $(2N-2):2N$ structured sparsity patterns in large language models on commodity GPUs, achieving near-theoretical speedups without accuracy loss.

Contribution

It presents a new hardware-aware decomposition method and activation fusion technique to unlock acceleration for $(2N-2):2N$ sparsity patterns, previously unsupported on standard GPUs.

Findings

01

Achieves 1.33x speedup on compute-bound workloads.

02

Approaches the theoretical upper-bound speedup of 4/3 at 6:8 sparsity.

03

Demonstrates effectiveness across multiple GPU architectures and model families.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2 N - 2) : 2 N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2 N - 2) : 2 N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2 N - 2) : 2 N$ weight block into $N - 1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications