BurstAttention: An Efficient Distributed Attention Framework for   Extremely Long Sequences

Ao Sun; Weilin Zhao; Xu Han; Cheng Yang; Zhiyuan Liu; Chuan Shi,; Maosong Sun

arXiv:2403.09347·cs.DC·June 7, 2024·1 cites

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi,, Maosong Sun

PDF

Open Access 1 Repo

TL;DR

BurstAttention is a distributed attention framework that significantly improves the efficiency of processing extremely long sequences in Transformer models by reducing communication overheads and speeding up training.

Contribution

It introduces a novel distributed attention framework that optimizes memory and communication efficiency for long sequence processing in large language models.

Findings

01

Reduces communication overheads by 40%.

02

Achieves 1.37x speedup on 128K sequence length training.

03

Outperforms existing distributed attention solutions.

Abstract

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MayDomine/Burst-Attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Tensor decomposition and applications