Towards Demystifying Serverless Machine Learning Training

Jiawei Jiang; Shaoduo Gan; Yue Liu; Fanlin Wang; Gustavo Alonso; Ana; Klimovic; Ankit Singla; Wentao Wu; Ce Zhang

arXiv:2105.07806·cs.DC·May 18, 2021

Towards Demystifying Serverless Machine Learning Training

Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana, Klimovic, Ankit Singla, Wentao Wu, Ce Zhang

PDF

1 Repo

TL;DR

This paper systematically compares serverless (FaaS) and IaaS infrastructures for distributed machine learning training, revealing that serverless can be faster but not necessarily cheaper, especially for models with low communication overhead.

Contribution

It introduces LambdaML, a platform for fair comparison of FaaS and IaaS in ML training, and provides an analytic model for cost-performance tradeoffs.

Findings

01

FaaS is faster for models with efficient communication and quick convergence.

02

Serverless training is not significantly cheaper than IaaS.

03

Performance benefits depend on model communication characteristics.

Abstract

The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DS3Lab/LambdaML
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.