SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
Zongshun Zhang, Ibrahim Matta

TL;DR
SERFLOW introduces a cost-effective, adaptive framework for ML inference that dynamically offloads model stages across FaaS and IaaS, optimizing resource use and reducing cloud costs amidst real-world uncertainties.
Contribution
It models ML inference as multi-stage requests and proposes a novel resource provisioning method combining serverless and VM resources for cost efficiency.
Findings
Reduced cloud costs by over 23%
Effectively handles long-tail and variable request distributions
Balances load across FaaS and IaaS resources
Abstract
Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Big Data and Digital Economy
