DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang,, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He

TL;DR
DeepSpeed-Ulysses introduces a novel system optimization for training large language models with extremely long sequences, achieving scalable, efficient training by partitioning data along sequence length and optimizing communication.
Contribution
It presents a new sequence parallelism method that maintains constant communication overhead, enabling scalable training of long sequence Transformer models.
Findings
Trains 2.5x faster than state-of-the-art methods.
Supports 4x longer sequence lengths.
Maintains constant communication volume with increasing sequence length.
Abstract
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Wan-AI/Wan2.2-T2V-A14B-Diffusersmodel· 105k dl· ♡ 126105k dl♡ 126
- 🤗Wan-AI/Wan2.2-T2V-A14Bmodel· 18k dl· ♡ 43818k dl♡ 438
- 🤗lightx2v/Wan2.2-Lightningmodel· 25 dl· ♡ 60625 dl♡ 606
- 🤗magespace/Wan2.2-I2V-A14B-Lightning-Diffusersmodel· 36k dl· ♡ 136k dl♡ 1
- 🤗Wan-AI/Wan2.1-FLF2V-14B-720Pmodel· 1.4k dl· ♡ 2281.4k dl♡ 228
- 🤗wan-community/Wan2.1-FLF2V-14B-720Pmodel· 47 dl47 dl
- 🤗Wan-AI/Wan2.1-VACE-1.3Bmodel· 859 dl· ♡ 126859 dl♡ 126
- 🤗Wan-AI/Wan2.1-VACE-14Bmodel· 5.4k dl· ♡ 4885.4k dl♡ 488
- 🤗Wan-AI/Wan2.1-VACE-14B-diffusersmodel· 3.2k dl· ♡ 343.2k dl♡ 34
- 🤗Wan-AI/Wan2.1-VACE-1.3B-diffusersmodel· 3.8k dl· ♡ 293.8k dl♡ 29
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
