Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy

TL;DR
This paper introduces Polar Sparsity, a novel approach that leverages stable attention layer sparsity and hardware-efficient kernels to significantly accelerate large language model inference at scale, without accuracy loss.
Contribution
It presents Polar Sparsity, a new method that effectively scales contextual sparsity to large batch sizes by focusing on attention layer sparsity and developing specialized GPU kernels.
Findings
Achieves up to 2.2x speedup in LLM inference.
Demonstrates scalability of contextual sparsity to large batch sizes.
Maintains model accuracy while accelerating inference.
Abstract
Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
MethodsSoftmax · Attention Is All You Need · OPT
