Mirage: Towards Low-interruption Services on Batch GPU Clusters with   Reinforcement Learning

Qiyang Ding; Pengfei Zheng; Shreyas Kudari; Shivaram Venkataraman,; Zhao Zhang

arXiv:2306.14086·cs.DC·June 27, 2023

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Qiyang Ding, Pengfei Zheng, Shreyas Kudari, Shivaram Venkataraman,, Zhao Zhang

PDF

Open Access 1 Repo

TL;DR

Mirage is a reinforcement learning-based resource provisioner designed to reduce job interruptions on GPU clusters, significantly improving the quality of service for deep learning workloads by proactively managing resources.

Contribution

This paper introduces Mirage, a novel RL-based system that proactively reduces job interruptions on GPU clusters, enhancing productivity and QoS for deep learning tasks.

Findings

01

Reduces job interruptions by up to 100%

02

Safeguards up to 76% of jobs with zero interruption

03

Effective across different load levels

Abstract

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient to design a proactive provisioner using production job traces from three GPU clusters. We follow the standard machine learning practice by partitioning each job trace into training and validation subsets, then train each model using the training subset and evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaozhang/mirage
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Age of Information Optimization