KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Guilin Zhang; Wulan Guo; Ziqi Tan; Qiang Guan; and Hailong Jiang

arXiv:2507.07932·cs.DC·July 11, 2025

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, and Hailong Jiang

PDF

Open Access

TL;DR

This paper introduces KIS-S, a GPU-aware Kubernetes inference simulator combined with an RL-based autoscaler that learns efficient scaling policies in simulation and significantly improves latency and resource utilization.

Contribution

The paper presents KIS-S, a novel framework integrating GPU-aware simulation with RL-based autoscaling, enabling effective, generalizable scaling policies without retraining.

Findings

01

75.2% improvement in average reward

02

Up to 6.7x reduction in P95 latency

03

Effective across diverse traffic patterns

Abstract

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications