WVA: A Global Optimization Control Plane for llmd
Abhishek Malvankar, Lionel Villard, Mohammed Abdi, Tommaso Sgreccia, Evgeny Shindin, Braulio Dumba, Vishakha Ramani, Asser Tantawi, Tamar Eilam

TL;DR
WVA is a control plane designed for large language model inference that optimizes resource utilization and reduces failures by considering application-specific states and hardware heterogeneity, outperforming traditional autoscalers.
Contribution
Introduces WVA, a novel workload-aware autoscaler that couples scaling with internal server states and hardware heterogeneity for efficient LLM inference management.
Findings
37% increase in effective throughput
10x reduction in request failures
Reduced power consumption through hardware tiering
Abstract
As Large Language Models (LLMs) scale to handle massive concurrent traffic, optimizing the infrastructure required for inference has become a primary challenge. To manage the high cost of GPU resources while ensuring strict service-level objectives (SLOs), operators increasingly deploy models across heterogeneous hardware clusters that multiplex latency-sensitive online requests and throughput-oriented offline requests. However, traditional resource-centric autoscalers like the Kubernetes horizontal pod autoscaler (HPA) do not consider application-specific SLOs, hardware heterogeneity, or internal engine state (like KV cache utilization) globally. This leads to unnecessary scaling, severe resource underutilization, and disrupted stateful inference. To address these limitations, we introduce the Workload Variant Autoscaler (WVA), a specialized control plane co-designed with \texttt{llmd}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Software System Performance and Reliability
