Computation Scheduling for Distributed Machine Learning with Straggling Workers
Mohammad Mohammadi Amiri, Deniz Gunduz

TL;DR
This paper develops and analyzes scheduling schemes for distributed machine learning that reduce completion time by effectively managing straggling workers and leveraging redundancy, with theoretical and experimental validation.
Contribution
It introduces two novel scheduling schemes for distributed learning that optimize task assignment and order, achieving near-optimal completion times under random delays.
Findings
Proposed schemes significantly reduce average completion time.
Experimental results on Amazon EC2 validate improvements over existing methods.
The gap between schemes and the theoretical lower bound is small.
Abstract
We study scheduling of computation tasks across n workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to workers in order to tolerate stragglers. We consider sequential computation of tasks assigned to a worker, while the result of each computation is sent to the master right after its completion. Each computation round, which can model an iteration of the stochastic gradient descent (SGD) algorithm, is completed once the master receives k distinct computations, referred to as the computation target. Our goal is to characterize the average completion time as a function of the computation load, which denotes the portion of the dataset available at each worker, and the computation target. We propose two computation scheduling schemes that specify the tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
