Robust Fully-Asynchronous Methods for Distributed Training over General   Architecture

Zehan Zhu; Ye Tian; Yan Huang; Jinming Xu; Shibo He

arXiv:2307.11617·cs.DC·July 30, 2024

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Zehan Zhu, Ye Tian, Yan Huang, Jinming Xu, Shibo He

PDF

Open Access

TL;DR

This paper introduces R-FAST, a robust asynchronous distributed training method that handles data heterogeneity, packet loss, and flexible communication architectures, achieving faster convergence and comparable accuracy to synchronous methods.

Contribution

The paper proposes R-FAST, a novel asynchronous distributed training algorithm with robust gradient tracking and flexible communication, improving efficiency and resilience over existing methods.

Findings

01

R-FAST converges to a neighborhood of the optimum with geometric rate for convex problems.

02

R-FAST converges to a stationary point with sublinear rate for non-convex problems.

03

R-FAST is 1.5-2 times faster than synchronous benchmarks and outperforms existing asynchronous algorithms.

Abstract

Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM