SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity
Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

TL;DR
SlideSparse introduces a novel system that enables efficient acceleration of $(2N-2):2N$ structured sparsity patterns in large language models on commodity GPUs, achieving near-theoretical speedups without accuracy loss.
Contribution
It presents a new hardware-aware decomposition method and activation fusion technique to unlock acceleration for $(2N-2):2N$ sparsity patterns, previously unsupported on standard GPUs.
Findings
Achieves 1.33x speedup on compute-bound workloads.
Approaches the theoretical upper-bound speedup of 4/3 at 6:8 sparsity.
Demonstrates effectiveness across multiple GPU architectures and model families.
Abstract
NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any weight block into overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
