Efficient AllReduce with Stragglers
Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, Rachee Singh

TL;DR
StragglAR is a novel AllReduce algorithm that exploits GPU execution time variation to reduce synchronization delays, achieving significant speedups in distributed machine learning workloads.
Contribution
It introduces a new parallel AllReduce algorithm that leverages natural GPU execution asymmetry to mitigate straggler effects, surpassing bandwidth-efficient methods.
Findings
Achieves 2x theoretical speedup over bandwidth-efficient algorithms.
Provides 25% speedup on an 8-GPU server compared to state-of-the-art methods.
Surpasses the lower bound for bandwidth-optimal AllReduce by exploiting asymmetry.
Abstract
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the slowest GPU to reach the synchronization barrier before the collective (i.e., the straggler). To address this challenge, we propose StragglAR: a parallel algorithm for AllReduce that accelerates distributed training and inference by exploiting natural variation in GPU execution times. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the final GPU reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient algorithms for large GPU clusters, surpassing the lower bound for…
Peer Reviews
Decision·Submitted to ICLR 2026
- Paper is about an important problem in enabling large-scale distributed training - In the limited scenarios the algorithm is evaluated, it lowers the upper bound of allreduce computation
- It is a distributed systems focused paper so an ML conference might not be a good fit in terms of its topic and audience. MLSys, ASPLOS or HPC conferences would be a better fit for this topic. - The algorithm, upper bound calculations assume there is only one straggler. But in real scenarios the delays exhibit a distribution, rather than a binary choice of straggler vs non-straggler categorization (as also illustrated in Figure 2 in the paper). Furthermore, artificial straggler delays are used
**[S1]** Stragglers are causing real problems in datacenters, and this work could save a lot of monetary cost, which the paper directly presents. **[S2]** This paper presents a solid algorithm with adequate proofs. I don't see any problems in the algorithm itself. **[S3]** Quantitative bandwidth analysis, telling the readers how much benefit can be gained in arbitrary environments.
**[W1]** Comparison with potential existing algorithms would be beneficial. For example, the authors mention that there exists a tree-based algorithm (AdapCC). I did not check AdapCC, but a layman knowing a tree-based AR algorithm could design a AR/RS a tree such that the straggler participates as late as possible. I believe it can be a simpler alternative. Would it be possible to devise a performance model of this to compare? **[W2]** The experiments are weak in scale and settings. - The expe
1. Well-motivated: This paper works on a practical problem prevalent in distributed training with stragglers. It has strong motivation with empirical and literature support for straggler prevalence. It has an innovative low-level AR redesign. 2. Solid Work: Solid algorithm design with clear exposition. The experiments on a small-scale environment provide preliminary verification of the proposed method's effectiveness under varying straggler delays. The end-to-end speedup is remarkable.
1. Unfair Complexity Analysis: The complexity analysis is not general, as it only considers the ideal case where RS execution is hidden by straggler latency. This makes the comparison with other AR algorithms unfair. 2. Missing Baseline: The paper omits a comparison with a critical and more recent baseline, MSCCL++ (https://arxiv.org/pdf/2504.09014), making it difficult to assess its incremental contribution. 3. Figure 7 shows that the straggler delay CDF varies by environment, yet Figures 5(b)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
