Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan; Sameera Ramasinghe; Yan Zuo; Gil Avraham; and; Alexander Long

arXiv:2505.01099·cs.LG·May 5, 2025

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, and, Alexander Long

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a Nesterov Accelerated Gradient variant tailored for asynchronous pipeline parallel training of large neural networks, addressing gradient staleness and demonstrating superior convergence and performance.

Contribution

It proposes a novel NAG-based method for asynchronous pipeline parallelism that effectively mitigates gradient staleness and proves convergence under fixed delay.

Findings

01

Outperforms existing asynchronous methods in large-scale language modeling

02

Achieves convergence with fixed gradient delay

03

Surpasses synchronous training baseline

Abstract

Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pluralisresearch/asyncpp
pytorchOfficial

Videos

Nesterov Method for Asynchronous Pipeline Parallel Optimization· slideslive

Taxonomy

TopicsPower Systems and Technologies · VLSI and FPGA Design Techniques · Experimental Learning in Engineering

MethodsNesterov Accelerated Gradient