NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu; Yufei Gao; Yilong Zhao; Liangyu Zhao; Gefei Zuo; Yile Gu; Dedong Xie; Tian Tang; Qinyu Xu; Zihao Ye; Keisuke Kamahori; Chien-Yu Lin; Ziren Wang; Stephanie Wang; Arvind Krishnamurthy; Baris Kasikci

arXiv:2408.12757·cs.DC·May 27, 2025

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

PDF

Open Access 2 Repos

TL;DR

NanoFlow is a novel serving framework that enhances large language model throughput by exploiting intra-device parallelism through input splitting and operation duplication, significantly improving efficiency.

Contribution

NanoFlow introduces a method to overlap heterogeneous resource usage within a device, optimizing LLM serving throughput by automatic nano-batch management and resource allocation.

Findings

01

Achieves 1.91x throughput boost over state-of-the-art systems.

02

Reaches 50% to 72% of optimal throughput on popular models.

03

Effective for models like LLaMA-2-70B and Mixtral 8x7B.

Abstract

Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling