Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin

TL;DR
This paper introduces UPipe, a memory-efficient context parallelism method that enables training of Transformer models with much longer sequences by reducing activation memory usage through headwise chunking.
Contribution
UPipe is a novel technique that performs fine-grained chunking at the attention head level to significantly reduce memory usage and support longer context lengths in Transformer training.
Findings
Reduces activation memory by up to 87.5% for 32B Transformers.
Supports context length of 5 million tokens for Llama3-8B on a single node.
Matches previous methods in training speed while enabling longer sequences.
Abstract
Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
