Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony,, Beren Millidge

TL;DR
Tree Attention introduces a topology-aware decoding algorithm that leverages tree reduction for efficient, parallel exact attention computation across multiple GPUs, significantly improving speed and reducing memory and communication costs.
Contribution
The paper presents Tree Attention, a novel parallel attention algorithm that outperforms existing methods in speed and efficiency for long-context decoding on GPU clusters.
Findings
Achieves up to 8x faster decoding than Ring Attention.
Reduces communication volume and peak memory usage by half.
Speeds up decoding by up to 4x on Llama 3.1-8B.
Abstract
Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs enables cross-device decoding to be performed asymptotically faster (up to 8x faster in our experiments) than state-of-the-art approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2x less peak memory. We demonstrate that Tree Attention speeds up decoding up to 4x on Llama 3.1-8B and can be applied to a variety of hardware and networking setups such as H100 DGX nodes, AMD MI300x nodes, and PCIe connected NVIDIA RTX 4090s. Our code is publicly available here: https://github.com/Zyphra/tree_attention
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques
MethodsSoftmax · Attention Is All You Need
