Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Susav Shrestha; Brad Settlemyer; Nikoli Dryden; Narasimha Reddy

arXiv:2505.14884·cs.LG·November 13, 2025

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces Polar Sparsity, a novel approach that leverages stable attention layer sparsity and hardware-efficient kernels to significantly accelerate large language model inference at scale, without accuracy loss.

Contribution

It presents Polar Sparsity, a new method that effectively scales contextual sparsity to large batch sizes by focusing on attention layer sparsity and developing specialized GPU kernels.

Findings

01

Achieves up to 2.2x speedup in LLM inference.

02

Demonstrates scalability of contextual sparsity to large batch sizes.

03

Maintains model accuracy while accelerating inference.

Abstract

Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

susavlsh10/polar-sparsity
jaxOfficial

Models

🤗
Susav/PolarSparsity
model· ♡ 1
♡ 1

Videos

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms

MethodsSoftmax · Attention Is All You Need · OPT