DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme   Long Sequence Transformer Models

Sam Ade Jacobs; Masahiro Tanaka; Chengming Zhang; Minjia Zhang,; Shuaiwen Leon Song; Samyam Rajbhandari; Yuxiong He

arXiv:2309.14509·cs.LG·October 5, 2023·6 cites

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang,, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He

PDF

Open Access 5 Repos 10 Models 1 Datasets

TL;DR

DeepSpeed-Ulysses introduces a novel system optimization for training large language models with extremely long sequences, achieving scalable, efficient training by partitioning data along sequence length and optimizing communication.

Contribution

It presents a new sequence parallelism method that maintains constant communication overhead, enabling scalable training of long sequence Transformer models.

Findings

01

Trains 2.5x faster than state-of-the-art methods.

02

Supports 4x longer sequence lengths.

03

Maintains constant communication volume with increasing sequence length.

Abstract

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

xingzhaohu/agentshot
dataset· 77k dl
77k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections