S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
Di Liu, Yifei Liu, Chen Chen, Zhibin Yu, Xiaoyi Fan, Quan Chen, Minyi Guo

TL;DR
This paper introduces S-HPLB, a system that improves the efficiency of large language model attention serving by adaptively balancing sparsity and load across GPUs, significantly reducing latency.
Contribution
It proposes a novel sparsity-aware load balancing strategy for attention heads that adaptively allocates computation resources based on head sparsity, enhancing efficiency.
Findings
Achieves 2.88x reduction in attention computation latency
Maintains high inference quality despite sparsity optimization
Effectively balances load across GPUs in long-context benchmarks
Abstract
With the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize the attention heads on multiple GPUs, and also widely adopt attention sparsification to reduce the computation amount -- which selectively computes a subset of attention pairs under a preset sparsity budget. In this paper, we notice that attention heads of an LLM model often exhibit heterogeneous-yet-stable sparsity elasticities, which motivates us to enforce head-adaptive sparsity budgets to attain better efficiency while preserving high inference quality. Yet, from the system aspect, with heterogeneous sparsity levels, attention computation time on different heads would be inconsistent, yielding cross-GPU resource bubbles under head-parallel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare
