Machine Learning on Volatile Instances
Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, and Carlee Joe-Wong

TL;DR
This paper investigates how preemptible cloud instances with volatile availability impact distributed stochastic gradient descent, proposing strategies to optimize training cost and time despite interruptions.
Contribution
It introduces the first analysis of how instance preemption affects SGD convergence and training time, providing practical strategies for cost-effective distributed training.
Findings
Preemption probability significantly impacts training time and accuracy.
Strategies can achieve near-standard training performance at lower costs.
Experimental results validate cost savings with maintained training quality.
Abstract
Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Explainable Artificial Intelligence (XAI)
MethodsStochastic Gradient Descent
