Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie, Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

TL;DR
Pollux is a co-adaptive scheduler for deep learning clusters that dynamically optimizes resource allocation and training efficiency, significantly reducing job completion times and improving fairness by modeling and maximizing goodput.
Contribution
Pollux introduces a novel co-optimization approach that adaptively reallocates resources based on real-time goodput modeling during training, enhancing efficiency and fairness in DL scheduling.
Findings
Reduces average job completion times by 37-50%.
Improves resource utilization and fairness among DL jobs.
Reveals opportunities for cost reduction in cloud DL training.
Abstract
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources. Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a novel metric we introduce that combines system throughput with statistical efficiency) would change by adding or removing resources. Leveraging these information, Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
