SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

Hongtao Xu; Jianchao Tan; Yuxuan Hu; Pengju Lu; Hongyu Wang; Pingwei Sun; Yerui Sun; Yuchen Xie; Xunliang Cai; Mingzhen Li; Weile Jia

arXiv:2604.13847·cs.LG·April 27, 2026

SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia

PDF

TL;DR

SparseBalance is a co-designed algorithm-system framework that enhances long-context training efficiency and accuracy by dynamically balancing sparsity and sequence heterogeneity in sparse attention models.

Contribution

It introduces workload-aware dynamic sparsity tuning and a sparsity-aware batching strategy to jointly optimize model accuracy and system efficiency.

Findings

01

Achieves up to 1.33× speedup in training.

02

Improves long-context capability by 0.46% on LongBench.

03

Effectively balances heterogeneity issues in sparse attention training.

Abstract

While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.