Towards Resource-Efficient Serverless LLM Inference with SLINFER

Chuhao Xu; Zijun Li; Quan Chen; Han Zhao; Xueyan Tang; Minyi Guo

arXiv:2507.00507·cs.DC·December 16, 2025

Towards Resource-Efficient Serverless LLM Inference with SLINFER

Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Xueyan Tang, Minyi Guo

PDF

Open Access

TL;DR

SLINFER is a resource-efficient serverless inference scheme for small- to mid-sized LLMs that optimizes hardware utilization and sharing, significantly increasing serving capacity on heterogeneous CPU and GPU platforms.

Contribution

It introduces a novel resource management approach for serverless LLM inference, enabling elastic sharing across CPUs and GPUs with fine-grained allocation and memory management.

Findings

01

SLINFER improves serving capacity by 47% - 62% through sharing.

02

Further CPU utilization boosts capacity by 86% - 154%.

03

Experimental results demonstrate significant efficiency gains on heterogeneous hardware.

Abstract

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously. We propose SLINFER, a resource-efficient serverless inference scheme tailored for small- to mid-sized LLMs that enables elastic and on-demand sharing across heterogeneous hardware. SLINFER tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Security and Verification in Computing