SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor R\"uhle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

TL;DR
SageServe is a dynamic LLM serving framework that optimizes cloud resource utilization and reduces costs by forecasting workloads and adaptively scaling GPU resources while maintaining SLA compliance.
Contribution
It introduces a workload characterization study and a comprehensive auto-scaling framework combining traffic forecasting with ILP-based resource allocation for LLM serving.
Findings
Up to 25% GPU-hour savings compared to baseline
80% reduction in GPU-hour wastage
Potential monthly cost savings of $2.5 million
Abstract
Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
