AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism
Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long

TL;DR
AsyncMesh introduces fully asynchronous optimization techniques for data and pipeline parallelism in neural network training, reducing communication costs and maintaining performance on large-scale models.
Contribution
It proposes a novel asynchronous update framework with staleness mitigation strategies and provides convergence guarantees, enabling scalable training without co-location constraints.
Findings
Matches synchronous baseline performance on large models
Significantly reduces communication overhead
Provides convergence guarantees for asynchronous methods
Abstract
Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. AsyncMesh explores the setup where both DP and PP are asynchronous. 2. The paper designs an Exponential Moving Average (EMA) based correction mechanism that approximates the average staleness. 3. The paper provides theoretical justification of convergence in the presence of staleness in a homogeneous setup where only a small subset of weights is communicated between DP replicas.
1. The baseline for the evaluation is weak. The benchmark for this evaluation is weak. The evaluation only compares AsyncMesh with FullyAsync and DP. However, well-studied staleness-aware LLM training [1] (with different degree of staleness) and also block coordinate descent with correction [2] was not included in the evaluation. 2. The evaluation results did not show how much performance improvement sparse averaging could bring. [1] PipeDream: Generalized pipeline parallelism for DNN training
1. Theoretical analysis on AsyncPP and SPARTA in DP 2. e2e experiments training and show loss curves
1. the major experimental model is a toy size of 160M, which cannot represent real world pre-training model patterns. In addition, it is just toy NanoGPT not a real GPT model. Furthermore, the model does not even have basic dropout layer, which make the loss curve comparison less convincing. 2. the paper contribution is very minor, it just combined existed work AsyncPP and SPARTA in DP together and did a bit tuning. There is very little research novelty here. 3. Whether the model can converge
+ Introduces AsyncMesh, that enables asynchronous updates across both data parallelism (DP) and pipeline parallelism (PP) to address this communication bottleneck. + Combines Nesterov-based weight look-ahead for PP and Exponential Moving Average (EMA) correction for DP to counteract stale gradients and parameters effectively. + Provides formal convergence guarantees for both asynchronous sparse averaging and delayed updates, extending existing results from stochastic approximation theory. + Demo
- Theoretical convergence guarantees rely on homogeneous settings, which may not hold in practical heterogeneous or real-world decentralized systems. - Although sparse averaging reduces communication, it could slow convergence for extremely small subsets or large delays, as hinted in the theoretical analysis. No experiments have done on this. - The paper lacks direct comparisons with strong recent baselines such as DeepSpeed ZeRO, ZeRO++. - The effects of EMA decay rates, subset sizes, and delay
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
