Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Anderson de Lima Luiz; Shubham Vijay Kurlekar; Munir Georges

arXiv:2508.17814·cs.DC·August 26, 2025

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Anderson de Lima Luiz, Shubham Vijay Kurlekar, Munir Georges

PDF

TL;DR

This paper presents a scalable HPC-based architecture using SLURM for deploying and managing heterogeneous LLMs efficiently, demonstrating high throughput and low latency for small models and identifying saturation points for larger models.

Contribution

It introduces a novel scalable inference engine architecture leveraging SLURM, containerization, and REST APIs for efficient deployment and management of diverse LLMs in HPC environments.

Findings

01

Small models handle up to 128 concurrent requests with sub-50 ms latency.

02

Larger models saturate with as few as two concurrent users, with over 2 seconds latency.

03

The architecture scales reliably for batch and interactive inference scenarios.

Abstract

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic resource scheduling and seamless integration of containerized microservices have been leveraged herein to manage CPU, GPU, and memory allocations efficiently in multi-node clusters. Extensive experiments, using Llama 3.2 (1B and 3B parameters) [2] and Llama 3.1 (8B and 70B) [3], probe throughput, latency, and concurrency and show that small models can handle up to 128 concurrent requests at sub-50 ms latency, while for larger models, saturation happens with as few as two concurrent users, with a latency of more than 2 seconds. This architecture includes Representational State Transfer Application Programming Interfaces (REST APIs) [4] endpoints for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.