OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan, Piasetzky, Matty Kadosh, Lalith Suresh, Muhammad Shahbaz

TL;DR
OptiReduce is a resilient collective-communication system for distributed deep learning in the cloud that reduces tail latency and maintains accuracy despite variability and gradient drops.
Contribution
It introduces novel mechanisms and strategies leveraging DDL's inherent resiliency to improve tail performance and mitigate gradient drops, enhancing cloud-based training efficiency.
Findings
Achieves 70% faster time-to-accuracy than Gloo.
Achieves 30% faster time-to-accuracy than NCCL.
Effectively mitigates gradient drops impact on model accuracy.
Abstract
We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients -- providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs' tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Brain Tumor Detection and Classification
