Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation
Hyunji Jung, Sungbin Shin, Namhoon Lee

TL;DR
This paper introduces basis rotation to mitigate gradient staleness in asynchronous pipeline parallelism, significantly improving convergence speed and scalability in large-scale distributed training.
Contribution
The paper proposes a novel basis rotation method to correct gradient delays, enhancing the scalability and efficiency of asynchronous pipeline parallel training.
Findings
Achieves 76.8% fewer iterations in training a 1B-parameter LLM.
Restores effective curvature-aware optimization with basis rotation.
Theoretically and empirically demonstrates mitigation of gradient staleness effects.
Abstract
Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Low-power high-performance VLSI design · Advanced Memory and Neural Computing
