TL;DR
This paper systematically compares serverless (FaaS) and IaaS infrastructures for distributed machine learning training, revealing that serverless can be faster but not necessarily cheaper, especially for models with low communication overhead.
Contribution
It introduces LambdaML, a platform for fair comparison of FaaS and IaaS in ML training, and provides an analytic model for cost-performance tradeoffs.
Findings
FaaS is faster for models with efficient communication and quick convergence.
Serverless training is not significantly cheaper than IaaS.
Performance benefits depend on model communication characteristics.
Abstract
The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
