LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong, Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang, Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

TL;DR
LoongTrain introduces a novel 2D-Attention system combining head- and context-parallelism to efficiently train long-sequence LLMs at scale, outperforming existing methods in speed and scalability.
Contribution
The paper presents LoongTrain, a new system with 2D-Attention that enhances scalability and efficiency for training long-sequence LLMs, addressing limitations of prior sequence parallelism methods.
Findings
Outperforms DeepSpeed-Ulysses and Megatron in training speed and scalability.
Achieves up to 2.88x higher Model FLOPs Utilization.
Demonstrates effective device placement strategies for faster training.
Abstract
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis
