Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry, Xu, Minjia Zhang, and Zhihao Jia

TL;DR
Parcae is a proactive system that optimizes deep neural network training on preemptible cloud instances by predicting preemptions and adjusting parallelization strategies to reduce costs and improve robustness.
Contribution
It introduces a novel proactive approach using liveput optimization, prediction, and lightweight migration to enhance DNN training on preemptible instances.
Findings
Parcae outperforms existing systems by up to 10x in cost and speed.
It achieves near-optimal training performance under frequent preemptions.
Proactive strategies significantly improve robustness and efficiency.
Abstract
Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be preempted by the cloud provider at any time. Prior work that supports DNN training on preemptive instances employs a reactive approach to handling instance preemptions and allocations after their occurrence, which only achieves limited performance and scalability. We present Parcae, a system that enables cheap, fast, and scalable DNN training on preemptible instances by proactively adjusting the parallelization strategy of a DNN training job to adapt to predicted resource changes before instance preemptions and allocations really happen, which significantly reduces the cost of handling these events. Parcae optimizes liveput, a novel metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Robotics and Automated Systems · Topic Modeling
