Distributed Learning over Unreliable Networks

Chen Yu; Hanlin Tang; Cedric Renggli; Simon Kassing; Ankit Singla; Dan; Alistarh; Ce Zhang; Ji Liu

arXiv:1810.07766·cs.DC·May 17, 2019·21 cites

Distributed Learning over Unreliable Networks

Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan, Alistarh, Ce Zhang, Ji Liu

PDF

Open Access

TL;DR

This paper investigates the possibility of designing distributed machine learning systems that remain effective despite unreliable network conditions, providing theoretical guarantees and validating through simulations.

Contribution

It introduces a theoretical analysis demonstrating convergence of distributed learning algorithms over unreliable networks and shows the diminishing impact of packet drops with more parameter servers.

Findings

01

Distributed learning can converge despite network unreliability.

02

The impact of packet drops decreases as the number of parameter servers increases.

03

Simulations confirm system robustness over unreliable network layers.

Abstract

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Stochastic Gradient Optimization Techniques · Distributed Sensor Networks and Detection Algorithms