SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving
Christian Lysenst{\o}en

TL;DR
SLO-Guard is a crash-aware autotuning system for large language model serving that improves budget consistency and latency stability under SLO constraints by treating crashes as first-class observations.
Contribution
It introduces a novel crash-aware autotuning approach combining thermal budget annealing and TPE, with configuration repair and crash taxonomy, for more predictable tuning under failures.
Findings
SLO-Guard achieves higher budget consistency and latency stability than random search.
Both methods attain 75/75 feasibility with zero crashes in the study.
SLO-Guard's cross-seed latency variation is 4.4x tighter than random search.
Abstract
Serving large language models under latency service-level objectives (SLOs) is a configuration-heavy systems problem with an unusually failure-prone search space: many plausible configurations crash outright or miss user-visible latency targets, and standard black-box optimizers treat these failures as wasted trials. We present SLO-Guard, a crash-aware autotuner for vLLM serving that treats crashes as first-class observations. SLO-Guard combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase; the handoff replays all exploration history, including crashes encoded as extreme constraint violations. We additionally contribute a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. We evaluate SLO-Guard on Qwen2-1.5B served with vLLM 0.19 on an NVIDIA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
