Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Tim Trappen; Robert Ke{\ss}ler; Roland Pabel; Viktor Achter; Stefan Wesner

arXiv:2511.21413·cs.DC·November 27, 2025

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Tim Trappen, Robert Ke{\ss}ler, Roland Pabel, Viktor Achter, Stefan Wesner

PDF

Open Access

TL;DR

This paper presents an integrated architecture using vLLM, Slurm, and Kubernetes on HPC infrastructure to efficiently serve large language models with dynamic, user-facing AI workloads, demonstrating scalable performance with minimal latency overhead.

Contribution

It introduces a novel integration of HPC and cloud technologies for dynamic AI inference, addressing the limitations of traditional HPC models for user-facing workloads.

Findings

01

Efficient scaling for 100, 500, and 1000 concurrent requests.

02

End-to-end latency overhead of approximately 500 ms.

03

Effective utilization of HPC infrastructure for AI inference.

Abstract

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Scientific Computing and Data Management