HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Leyang Xue; Yao Fu; Luo Mai; Mahesh K. Marina

arXiv:2505.12566·cs.LG·May 20, 2025

HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Leyang Xue, Yao Fu, Luo Mai, Mahesh K. Marina

PDF

Open Access

TL;DR

HybridServe is a system that efficiently serves large AI models by dynamically choosing smaller or larger models based on confidence, significantly reducing energy consumption while maintaining accuracy.

Contribution

It introduces a confidence-based hybrid serving approach and a dataflow planner to optimize energy efficiency and throughput in large DNN serving systems.

Findings

01

Reduces energy footprint by up to 19.8x compared to state-of-the-art systems.

02

Maintains accuracy comparable to using only giant DNNs.

03

Improves system throughput through optimized model partitioning.

Abstract

Giant Deep Neural Networks (DNNs), have become indispensable for accurate and robust support of large-scale cloud based AI services. However, serving giant DNNs is prohibitively expensive from an energy consumption viewpoint easily exceeding that of training, due to the enormous scale of GPU clusters needed to hold giant DNN model partitions and replicas. Existing approaches can either optimize energy efficiency or inference accuracy but not both. To overcome this status quo, we propose HybridServe, a novel hybrid DNN model serving system that leverages multiple sized versions (small to giant) of the model to be served in tandem. Through a confidence based hybrid model serving dataflow, HybridServe prefers to serve inference requests with energy-efficient smaller models so long as accuracy is not compromised, thereby reducing the number of replicas needed for giant DNNs. HybridServe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Cloud Computing and Resource Management · Big Data and Digital Economy