A Survey of LLM Inference Systems
James Pan, Guoliang Li

TL;DR
This survey reviews recent techniques and system designs for large language model inference, analyzing their components, optimizations, and deployment strategies to improve performance and resource management.
Contribution
It provides a comprehensive analysis of LLM inference system techniques, comparing existing systems and discussing their integration and challenges.
Findings
Techniques rely on load prediction and adaptive mechanisms.
Optimization strategies include batching, scheduling, and memory management.
Systems can be single or multi-replica, disaggregated, or serverless.
Abstract
The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the unique autoregressive nature of LLM request processing, motivating new techniques for achieving high performance while preserving high inference quality over high-volume and high-velocity workloads. While many of these techniques are discussed across the literature, they have not been analyzed under the framework of a complete inference system, nor have the systems themselves been analyzed and compared. In this survey, we review these techniques, starting from operators and algorithms for request processing, then moving on to techniques for model optimization and execution, including kernel design, batching, and scheduling, before ending with techniques…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Software System Performance and Reliability
