Low-latency job scheduling with preemption for the development of deep   learning

Hidehito Yabuuchi; Daisuke Taniwaki; Shingo Omura

arXiv:1902.01613·cs.DC·February 6, 2019·6 cites

Low-latency job scheduling with preemption for the development of deep learning

Hidehito Yabuuchi, Daisuke Taniwaki, Shingo Omura

PDF

Open Access

TL;DR

This paper introduces a preemptive scheduling algorithm that significantly reduces the latency of trial-and-error deep learning jobs in computing clusters, balancing the needs of small experiments and overall throughput.

Contribution

The paper presents a novel preemptive scheduling algorithm that efficiently manages TE and BE jobs, improving TE job latency without greatly impacting BE jobs.

Findings

01

Reduced 95th percentile TE slowdown by 96.6% in simulations.

02

Achieved minimal impact on BE job slowdown, with only 18-24% increase.

03

Effective in both synthetic and real workload scenarios.

Abstract

One significant challenge in the job scheduling of computing clusters for the development of deep learning algorithms is the efficient scheduling of trial-and-error (TE) job, the type of job in which the users seek to conduct small-scale experiments while monitoring their processes. Unfortunately, the existing job schedulers to date do not feature well-balanced scheduling for the mixture of TE jobs and best-effort (BE) jobs, or they can handle the mixture in limited situations at most. To fill in this niche, we propose an algorithm that can significantly reduce the latency of TE jobs in versatile situations without greatly elongating the slowdown of the BE jobs. Our algorithm efficiently schedules both TE and BE jobs by selectively preempting the BE jobs that can be, when the time comes, resumed without much delay. In our simulation study with synthetic and real workloads, we were able…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques