CascadeServe: Unlocking Model Cascades for Inference Serving

Ferdi Kossmann; Ziniu Wu; Alex Turk; Nesime Tatbul; Lei Cao; Samuel; Madden

arXiv:2406.14424·cs.DC·June 21, 2024·1 cites

CascadeServe: Unlocking Model Cascades for Inference Serving

Ferdi Kossmann, Ziniu Wu, Alex Turk, Nesime Tatbul, Lei Cao, Samuel, Madden

PDF

Open Access

TL;DR

CascadeServe introduces an automated system that leverages model cascades for efficient, cost-effective inference serving, adapting dynamically to workload variations while maintaining accuracy.

Contribution

It presents CascadeServe, the first system to integrate model cascades into online inference serving with automated offline planning and real-time adaptation.

Findings

01

Achieves 2-3x cost savings over baselines

02

Effectively adapts to workload variations with minimal overhead

03

Maintains high accuracy while reducing computational costs

Abstract

Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare

MethodsSparse Evolutionary Training