SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

Shashwat Jaiswal; Kunal Jain; Yogesh Simmhan; Anjaly Parayil; Ankur Mallick; Rujia Wang; Renee St. Amant; Chetan Bansal; Victor R\"uhle; Anoop Kulkarni; Steve Kofsky; Saravan Rajmohan

arXiv:2502.14617·cs.DC·November 14, 2025·2 cites

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor R\"uhle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

PDF

Open Access

TL;DR

SageServe is a dynamic LLM serving framework that optimizes cloud resource utilization and reduces costs by forecasting workloads and adaptively scaling GPU resources while maintaining SLA compliance.

Contribution

It introduces a workload characterization study and a comprehensive auto-scaling framework combining traffic forecasting with ILP-based resource allocation for LLM serving.

Findings

01

Up to 25% GPU-hour savings compared to baseline

02

80% reduction in GPU-hour wastage

03

Potential monthly cost savings of $2.5 million

Abstract

Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques