Towards Resource-Efficient Serverless LLM Inference with SLINFER
Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Xueyan Tang, Minyi Guo

TL;DR
SLINFER is a resource-efficient serverless inference scheme for small- to mid-sized LLMs that optimizes hardware utilization and sharing, significantly increasing serving capacity on heterogeneous CPU and GPU platforms.
Contribution
It introduces a novel resource management approach for serverless LLM inference, enabling elastic sharing across CPUs and GPUs with fine-grained allocation and memory management.
Findings
SLINFER improves serving capacity by 47% - 62% through sharing.
Further CPU utilization boosts capacity by 86% - 154%.
Experimental results demonstrate significant efficiency gains on heterogeneous hardware.
Abstract
The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously. We propose SLINFER, a resource-efficient serverless inference scheme tailored for small- to mid-sized LLMs that enables elastic and on-demand sharing across heterogeneous hardware. SLINFER tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Security and Verification in Computing
