BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi,, Maosong Sun

TL;DR
BurstAttention is a distributed attention framework that significantly improves the efficiency of processing extremely long sequences in Transformer models by reducing communication overheads and speeding up training.
Contribution
It introduces a novel distributed attention framework that optimizes memory and communication efficiency for long sequence processing in large language models.
Findings
Reduces communication overheads by 40%.
Achieves 1.37x speedup on 128K sequence length training.
Outperforms existing distributed attention solutions.
Abstract
Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Tensor decomposition and applications
