A Survey of LLM Inference Systems

James Pan; Guoliang Li

arXiv:2506.21901·cs.DB·June 30, 2025

A Survey of LLM Inference Systems

James Pan, Guoliang Li

PDF

Open Access

TL;DR

This survey reviews recent techniques and system designs for large language model inference, analyzing their components, optimizations, and deployment strategies to improve performance and resource management.

Contribution

It provides a comprehensive analysis of LLM inference system techniques, comparing existing systems and discussing their integration and challenges.

Findings

01

Techniques rely on load prediction and adaptive mechanisms.

02

Optimization strategies include batching, scheduling, and memory management.

03

Systems can be single or multi-replica, disaggregated, or serverless.

Abstract

The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the unique autoregressive nature of LLM request processing, motivating new techniques for achieving high performance while preserving high inference quality over high-volume and high-velocity workloads. While many of these techniques are discussed across the literature, they have not been analyzed under the framework of a complete inference system, nor have the systems themselves been analyzed and compared. In this survey, we review these techniques, starting from operators and algorithms for request processing, then moving on to techniques for model optimization and execution, including kernel design, batching, and scheduling, before ending with techniques…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Software System Performance and Reliability