Synergistic Tensor and Pipeline Parallelism
Mengshi Qi, Jiaxuan Peng, Jie Zhang, Juan Zhu, Yong Li, Huadong Ma

TL;DR
This paper introduces a novel scheduling method that combines tensor and pipeline parallelism to reduce communication and synchronization overheads, significantly improving training throughput for large language models.
Contribution
It proposes a synergistic schedule that decouples and braids computation units, effectively eliminating TP bubbles and reducing PP bubbles for more efficient distributed training.
Findings
Up to 12% throughput improvement for LLMs
Up to 16% throughput improvement for MLLMs
Effective reduction of communication and synchronization overheads
Abstract
In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs (MLLMs). However, TP introduces significant collective communication overheads, while PP suffers from synchronization inefficiencies such as pipeline bubbles. Existing works primarily address these challenges from isolated perspectives, focusing either on overlapping TP communication or on flexible PP scheduling to mitigate pipeline bubbles. In this paper, we propose a new synergistic tensor and pipeline parallelism schedule that simultaneously reduces both types of bubbles. Our proposed schedule decouples the forward and backward passes in PP into fine-grained computation units, which are then braided to form a composite computation sequence. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTensor decomposition and applications · Topic Modeling · Advanced Neural Network Applications
