Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM
Tim Trappen, Robert Ke{\ss}ler, Roland Pabel, Viktor Achter, Stefan Wesner

TL;DR
This paper presents an integrated architecture using vLLM, Slurm, and Kubernetes on HPC infrastructure to efficiently serve large language models with dynamic, user-facing AI workloads, demonstrating scalable performance with minimal latency overhead.
Contribution
It introduces a novel integration of HPC and cloud technologies for dynamic AI inference, addressing the limitations of traditional HPC models for user-facing workloads.
Findings
Efficient scaling for 100, 500, and 1000 concurrent requests.
End-to-end latency overhead of approximately 500 ms.
Effective utilization of HPC infrastructure for AI inference.
Abstract
Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Scientific Computing and Data Management
