Adaptive Consensus Gradients Aggregation for Scaled Distributed Training
Yoni Choukroun, Shlomi Azoulay, Pavel Kisilev

TL;DR
This paper proposes an adaptive gradient aggregation method for distributed deep learning that improves convergence and performance by optimizing gradient weighting and introducing subspace momentum, outperforming simple averaging.
Contribution
It introduces a novel subspace optimization framework for gradient aggregation, including an adaptive weighting scheme and subspace momentum, enhancing efficiency and accuracy in distributed training.
Findings
Outperforms standard gradient averaging on MLPerf benchmarks
Reduces communication and computational costs
Accelerates convergence with subspace momentum
Abstract
Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
