Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu,, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou, Shan

TL;DR
TetriInfer is a novel approach that disaggregates and schedules LLM inference phases to significantly reduce interference, improve efficiency, and lower latency in cloud services.
Contribution
It introduces a new inference serving framework that separates prefill and decode phases and employs advanced scheduling to optimize resource utilization.
Findings
Reduces resource usage by 38%
Lowers average TTFT by 97%
Decreases average JCT by 47%
Abstract
Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Cloud Computing and Resource Management · Topic Modeling
