Inference without Interference: Disaggregate LLM Inference for Mixed   Downstream Workloads

Cunchen Hu; Heyang Huang; Liangliang Xu; Xusheng Chen; Jiang Xu,; Shuang Chen; Hao Feng; Chenxi Wang; Sa Wang; Yungang Bao; Ninghui Sun; Yizhou; Shan

arXiv:2401.11181·cs.DC·January 23, 2024·6 cites

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu,, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou, Shan

PDF

Open Access

TL;DR

TetriInfer is a novel approach that disaggregates and schedules LLM inference phases to significantly reduce interference, improve efficiency, and lower latency in cloud services.

Contribution

It introduces a new inference serving framework that separates prefill and decode phases and employs advanced scheduling to optimize resource utilization.

Findings

01

Reduces resource usage by 38%

02

Lowers average TTFT by 97%

03

Decreases average JCT by 47%

Abstract

Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Cloud Computing and Resource Management · Topic Modeling