TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Jiangsu Du, Zhiguang Chen, Yutong Lu

TL;DR
TD-Pipe introduces a novel temporally-disaggregated pipeline parallelism architecture that significantly improves throughput in large language model inference by reducing pipeline bubbles and balancing workloads.
Contribution
The paper proposes TD-Pipe, a new architecture for pipeline parallelism that disaggregates phases in time, with innovative control, prediction, and workload balancing techniques.
Findings
Increases LLM inference throughput by up to 1.91x over tensor parallelism.
Achieves up to 2.73x throughput improvement over traditional pipeline parallelism.
Effectively reduces pipeline bubbles and balances workloads across batches.
Abstract
As the model size continuously increases, pipeline parallelism shows great promise in throughput-oriented LLM inference due to its low demand on communications. However, imbalanced pipeline workloads and complex data dependencies in the prefill and decode phases result in massive pipeline bubbles and further severe performance reduction. To better exploit the pipeline parallelism for high-throughput LLM inference, we propose TD-Pipe, with the key idea lies in the temporally-disaggregated pipeline parallelism architecture. Specifically, this architecture disaggregates the prefill and decode phases in the temporal dimension, so as to eliminate pipeline bubbles caused by the phase switching. TD-Pipe identifies potential issues of exploiting the novel architecture and provides solutions. First, a hierarchy-controller structure is used to better coordinate devices in pipeline parallelism by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Network Packet Processing and Optimization · Parallel Computing and Optimization Techniques
