Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees
Shi Chang, Boyuan Chen, Kishanthan Thangarajah, Hanan Lutfiyya, Ahmed E. Hassan

TL;DR
This paper introduces SABER, an adaptive batching strategy for CodeLLM serving that predicts SLA feasibility and dynamically adjusts request batching, significantly improving throughput and reducing latency variability in resource-constrained environments.
Contribution
SABER is the first dynamic, SLA-aware batching method for CodeLLM serving that adapts in real-time without manual tuning, outperforming static configurations.
Findings
Improves goodput by up to 26% over static batching.
Reduces latency variability by up to 45%.
Eliminates the need for manual tuning or service restarts.
Abstract
Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows, yet efficiently serving them in resource-constrained, self-hosted environments remains a significant challenge. Existing LLM serving systems employs Continuous Batching for throughput improvement. However, they rely on static batch size configurations that cannot adapt to fluctuating request rates or heterogeneous workloads, leading to frequent SLA (Service Level Agreement) violations and unstable performance. In this study, We propose SABER, a dynamic batching strategy that predicts per-request SLA feasibility and adjusts decisions in real time. SABER improves goodput by up to 26% over the best static configurations and reduces latency variability by up to 45%, all without manual tuning or service restarts. Our results demonstrate that SLA-aware, adaptive scheduling is key to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
