Themis: Fair and Efficient GPU Cluster Scheduling

Kshiteej Mahajan; Arjun Balasubramanian; Arjun Singhvi; Shivaram; Venkataraman; Aditya Akella; Amar Phanishayee; Shuchi Chawla

arXiv:1907.01484·cs.DC·October 30, 2019·22 cites

Themis: Fair and Efficient GPU Cluster Scheduling

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram, Venkataraman, Aditya Akella, Amar Phanishayee, Shuchi Chawla

PDF

Open Access

TL;DR

Themis is a novel GPU cluster scheduling framework that ensures fair and efficient resource allocation for ML workloads by using a finish-time fairness policy and an auction-based two-level scheduling architecture.

Contribution

Themis introduces a new finish-time fairness policy and an auction-based scheduling architecture tailored for ML workloads' unique attributes.

Findings

01

Improves fairness by over 2.25 times compared to existing schedulers.

02

Achieves 5% to 250% higher cluster efficiency.

03

Effectively handles long-running, placement-sensitive ML tasks.

Abstract

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads' unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Cloud Computing and Resource Management · IoT and Edge/Fog Computing