Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning
Qiyang Ding, Pengfei Zheng, Shreyas Kudari, Shivaram Venkataraman,, Zhao Zhang

TL;DR
Mirage is a reinforcement learning-based resource provisioner designed to reduce job interruptions on GPU clusters, significantly improving the quality of service for deep learning workloads by proactively managing resources.
Contribution
This paper introduces Mirage, a novel RL-based system that proactively reduces job interruptions on GPU clusters, enhancing productivity and QoS for deep learning tasks.
Findings
Reduces job interruptions by up to 100%
Safeguards up to 76% of jobs with zero interruption
Effective across different load levels
Abstract
Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient to design a proactive provisioner using production job traces from three GPU clusters. We follow the standard machine learning practice by partitioning each job trace into training and validation subsets, then train each model using the training subset and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Age of Information Optimization
