HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You

TL;DR
HelixPipe introduces a novel pipeline parallelism technique for long sequence transformer training, significantly improving efficiency and scalability by reducing memory overhead and balancing computation and communication.
Contribution
It proposes attention parallel partition and a new micro batch scheduling method to enhance long sequence transformer training performance.
Findings
Achieves 26% speedup over baseline on 7B model with 128k sequences.
Outperforms existing methods in throughput and scalability.
Demonstrates effectiveness across various model sizes and cluster configurations.
Abstract
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Low-power high-performance VLSI design · Advanced Neural Network Applications
