S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Di Liu; Yifei Liu; Chen Chen; Zhibin Yu; Xiaoyi Fan; Quan Chen; Minyi Guo

arXiv:2603.10353·cs.DC·March 12, 2026

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Di Liu, Yifei Liu, Chen Chen, Zhibin Yu, Xiaoyi Fan, Quan Chen, Minyi Guo

PDF

Open Access

TL;DR

This paper introduces S-HPLB, a system that improves the efficiency of large language model attention serving by adaptively balancing sparsity and load across GPUs, significantly reducing latency.

Contribution

It proposes a novel sparsity-aware load balancing strategy for attention heads that adaptively allocates computation resources based on head sparsity, enhancing efficiency.

Findings

01

Achieves 2.88x reduction in attention computation latency

02

Maintains high inference quality despite sparsity optimization

03

Effectively balances load across GPUs in long-context benchmarks

Abstract

With the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize the attention heads on multiple GPUs, and also widely adopt attention sparsification to reduce the computation amount -- which selectively computes a subset of attention pairs under a preset sparsity budget. In this paper, we notice that attention heads of an LLM model often exhibit heterogeneous-yet-stable sparsity elasticities, which motivates us to enforce head-adaptive sparsity budgets to attain better efficiency while preserving high inference quality. Yet, from the system aspect, with heterogeneous sparsity levels, attention computation time on different heads would be inconsistent, yielding cross-GPU resource bubbles under head-parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare