A System for Microserving of LLMs
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C., Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

TL;DR
This paper introduces LLM microserving, a flexible multi-level architecture with APIs for fine-grained control, enabling dynamic reconfiguration and improved efficiency in large language model inference serving.
Contribution
It presents a novel microserving architecture with APIs and a programmable router for dynamic coordination, enhancing flexibility and performance in LLM inference systems.
Findings
Supports multiple disaggregation strategies with minimal code
Reduces job completion time by up to 47%
Maintains state-of-the-art inference performance
Abstract
The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
