A System for Microserving of LLMs

Hongyi Jin; Ruihang Lai; Charlie F. Ruan; Yingcheng Wang; Todd C.; Mowry; Xupeng Miao; Zhihao Jia; Tianqi Chen

arXiv:2412.12488·cs.DC·December 18, 2024

A System for Microserving of LLMs

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C., Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

PDF

Open Access

TL;DR

This paper introduces LLM microserving, a flexible multi-level architecture with APIs for fine-grained control, enabling dynamic reconfiguration and improved efficiency in large language model inference serving.

Contribution

It presents a novel microserving architecture with APIs and a programmable router for dynamic coordination, enhancing flexibility and performance in LLM inference systems.

Findings

01

Supports multiple disaggregation strategies with minimal code

02

Reduces job completion time by up to 47%

03

Maintains state-of-the-art inference performance

Abstract

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods