Tree Attention: Topology-aware Decoding for Long-Context Attention on   GPU clusters

Vasudev Shyam; Jonathan Pilault; Emily Shepperd; Quentin Anthony,; Beren Millidge

arXiv:2408.04093·cs.LG·February 11, 2025

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony,, Beren Millidge

PDF

Open Access 1 Repo

TL;DR

Tree Attention introduces a topology-aware decoding algorithm that leverages tree reduction for efficient, parallel exact attention computation across multiple GPUs, significantly improving speed and reducing memory and communication costs.

Contribution

The paper presents Tree Attention, a novel parallel attention algorithm that outperforms existing methods in speed and efficiency for long-context decoding on GPU clusters.

Findings

01

Achieves up to 8x faster decoding than Ring Attention.

02

Reduces communication volume and peak memory usage by half.

03

Speeds up decoding by up to 4x on Llama 3.1-8B.

Abstract

Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs enables cross-device decoding to be performed asymptotically faster (up to 8x faster in our experiments) than state-of-the-art approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2x less peak memory. We demonstrate that Tree Attention speeds up decoding up to 4x on Llama 3.1-8B and can be applied to a variety of hardware and networking setups such as H100 DGX nodes, AMD MI300x nodes, and PCIe connected NVIDIA RTX 4090s. Our code is publicly available here: https://github.com/Zyphra/tree_attention

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zyphra/tree_attention
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques

MethodsSoftmax · Attention Is All You Need