Srifty: Swift and Thrifty Distributed Training on the Cloud
Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze

TL;DR
Srifty is a system that optimizes cloud VM selection for distributed neural network training by predicting performance and cost, improving efficiency and reducing expenses in real-world cloud environments.
Contribution
This work introduces Srifty, a novel system combining runtime profiling and learned models to optimize VM selection for distributed training on the cloud, considering heterogeneity and spot instances.
Findings
Achieves 8% iteration latency prediction error.
Provides significant throughput gains.
Reduces training costs effectively.
Abstract
Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Stochastic Gradient Optimization Techniques
