LLM Inference Serving: Survey of Recent Advances and Opportunities
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari

TL;DR
This survey reviews recent system-level advancements in LLM inference serving since 2023, emphasizing performance and efficiency improvements for real-world deployment without changing core decoding methods.
Contribution
It provides a comprehensive overview of recent research, highlighting key innovations and practical considerations for deploying and scaling LLMs in production environments.
Findings
Identifies recent system-level enhancements for LLM serving
Highlights performance and efficiency improvements
Provides practical deployment insights
Abstract
This survey offers a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since the year 2023. We specifically examine system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. By selecting and reviewing high-quality papers from prestigious ML and system venues, we highlight key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. This survey serves as a valuable resource for LLM practitioners seeking to stay abreast of the latest developments in this rapidly evolving field.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
