Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep   Learning

Aurick Qiao; Sang Keun Choe; Suhas Jayaram Subramanya; Willie; Neiswanger; Qirong Ho; Hao Zhang; Gregory R. Ganger; Eric P. Xing

arXiv:2008.12260·cs.DC·May 27, 2021·24 cites

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie, Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

PDF

Open Access 2 Repos

TL;DR

Pollux is a co-adaptive scheduler for deep learning clusters that dynamically optimizes resource allocation and training efficiency, significantly reducing job completion times and improving fairness by modeling and maximizing goodput.

Contribution

Pollux introduces a novel co-optimization approach that adaptively reallocates resources based on real-time goodput modeling during training, enhancing efficiency and fairness in DL scheduling.

Findings

01

Reduces average job completion times by 37-50%.

02

Improves resource utilization and fairness among DL jobs.

03

Reveals opportunities for cost reduction in cloud DL training.

Abstract

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources. Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a novel metric we introduce that combines system throughput with statistical efficiency) would change by adding or removing resources. Leveraging these information, Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data