HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

Geng Zhang; Shenggan Cheng; Xuanlei Zhao; Ziming Liu; Yang You

arXiv:2507.00394·cs.LG·July 2, 2025

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You

PDF

Open Access

TL;DR

HelixPipe introduces a novel pipeline parallelism technique for long sequence transformer training, significantly improving efficiency and scalability by reducing memory overhead and balancing computation and communication.

Contribution

It proposes attention parallel partition and a new micro batch scheduling method to enhance long sequence transformer training performance.

Findings

01

Achieves 26% speedup over baseline on 7B model with 128k sequences.

02

Outperforms existing methods in throughput and scalability.

03

Demonstrates effectiveness across various model sizes and cluster configurations.

Abstract

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Low-power high-performance VLSI design · Advanced Neural Network Applications