WVA: A Global Optimization Control Plane for llmd

Abhishek Malvankar; Lionel Villard; Mohammed Abdi; Tommaso Sgreccia; Evgeny Shindin; Braulio Dumba; Vishakha Ramani; Asser Tantawi; Tamar Eilam

arXiv:2603.09730·cs.ET·March 23, 2026

WVA: A Global Optimization Control Plane for llmd

Abhishek Malvankar, Lionel Villard, Mohammed Abdi, Tommaso Sgreccia, Evgeny Shindin, Braulio Dumba, Vishakha Ramani, Asser Tantawi, Tamar Eilam

PDF

Open Access

TL;DR

WVA is a control plane designed for large language model inference that optimizes resource utilization and reduces failures by considering application-specific states and hardware heterogeneity, outperforming traditional autoscalers.

Contribution

Introduces WVA, a novel workload-aware autoscaler that couples scaling with internal server states and hardware heterogeneity for efficient LLM inference management.

Findings

01

37% increase in effective throughput

02

10x reduction in request failures

03

Reduced power consumption through hardware tiering

Abstract

As Large Language Models (LLMs) scale to handle massive concurrent traffic, optimizing the infrastructure required for inference has become a primary challenge. To manage the high cost of GPU resources while ensuring strict service-level objectives (SLOs), operators increasingly deploy models across heterogeneous hardware clusters that multiplex latency-sensitive online requests and throughput-oriented offline requests. However, traditional resource-centric autoscalers like the Kubernetes horizontal pod autoscaler (HPA) do not consider application-specific SLOs, hardware heterogeneity, or internal engine state (like KV cache utilization) globally. This leads to unnecessary scaling, severe resource underutilization, and disrupted stateful inference. To address these limitations, we introduce the Workload Variant Autoscaler (WVA), a specialized control plane co-designed with \texttt{llmd}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Software System Performance and Reliability