Enabling Dynamic Sparsity in Quantized LLM Inference

Rongxiang Wang; Kangyuan Shu; Felix Xiaozhu Lin

arXiv:2511.04477·cs.DC·November 7, 2025

Enabling Dynamic Sparsity in Quantized LLM Inference

Rongxiang Wang, Kangyuan Shu, Felix Xiaozhu Lin

PDF

Open Access

TL;DR

This paper introduces techniques to enable dynamic sparsity in quantized large language model inference, significantly improving decoding speed on resource-constrained hardware while preserving accuracy.

Contribution

It proposes a novel method combining dynamic sparsity with low-bit quantization, including a zigzag quantization layout, a specialized GEMV kernel, and an efficient runtime mechanism.

Findings

01

Achieves up to 1.55x faster decoding throughput.

02

Maintains accuracy comparable to dense quantized inference.

03

Demonstrates effective coexistence of structured sparsity and quantization on GPUs.

Abstract

Deploying large language models (LLMs) on end-user devices is gaining importance due to benefits in responsiveness, privacy, and operational cost. Yet the limited memory and compute capability of mobile and desktop GPUs make efficient execution difficult. Recent observations suggest that the internal activations of LLMs are often dynamically sparse, meaning that for each input, only part of the network contributes significantly to the output. Such sparsity could reduce computation, but it interacts poorly with group-wise quantization, which remains the dominant approach for fitting LLMs onto resource-constrained hardware. To reconcile these two properties, this study proposes a set of techniques that realize dynamic sparse inference under low-bit quantization. The method features: (1) a zigzag-patterned quantization layout that organizes weights in a way consistent with activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques