Themis: Fair and Efficient GPU Cluster Scheduling
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram, Venkataraman, Aditya Akella, Amar Phanishayee, Shuchi Chawla

TL;DR
Themis is a novel GPU cluster scheduling framework that ensures fair and efficient resource allocation for ML workloads by using a finish-time fairness policy and an auction-based two-level scheduling architecture.
Contribution
Themis introduces a new finish-time fairness policy and an auction-based scheduling architecture tailored for ML workloads' unique attributes.
Findings
Improves fairness by over 2.25 times compared to existing schedulers.
Achieves 5% to 250% higher cluster efficiency.
Effectively handles long-running, placement-sensitive ML tasks.
Abstract
Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads' unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
