Distributed Learning over Unreliable Networks
Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan, Alistarh, Ce Zhang, Ji Liu

TL;DR
This paper investigates the possibility of designing distributed machine learning systems that remain effective despite unreliable network conditions, providing theoretical guarantees and validating through simulations.
Contribution
It introduces a theoretical analysis demonstrating convergence of distributed learning algorithms over unreliable networks and shows the diminishing impact of packet drops with more parameter servers.
Findings
Distributed learning can converge despite network unreliability.
The impact of packet drops decreases as the number of parameter servers increases.
Simulations confirm system robustness over unreliable network layers.
Abstract
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability of being dropped, does there exist an algorithm that still converges, and at what speed? The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Stochastic Gradient Optimization Techniques · Distributed Sensor Networks and Detection Algorithms
