Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality
Sirui Chen, Jingji Chen, Siqi Zhu, Ziheng Jiang, Yanghua Peng, Xuehai Qian

TL;DR
Mesh-Attention introduces a novel 2D tiled distributed attention algorithm that significantly reduces communication overhead and improves scalability for large language models on multi-GPU systems.
Contribution
The paper presents Mesh-Attention, a new matrix-based distributed attention method with a 2D tiling approach, outperforming Ring-Attention in efficiency and scalability.
Findings
Achieves up to 3.4x speedup over existing methods.
Reduces communication volume by up to 85.4%.
Maintains high performance as system scales to 256 GPUs.
Abstract
Distributed attention is a fundamental problem for scaling context window for Large Language Models (LLMs). The state-of-the-art method, Ring-Attention, suffers from scalability limitations due to its excessive communication traffic. This paper proposes a new distributed attention algorithm, Mesh-Attention, by rethinking the design space of distributed attention with a new matrix-based model. Our method assigns a two-dimensional tile -- rather than one-dimensional row or column -- of computation blocks to each GPU to achieve higher efficiency through lower communication-computation (CommCom) ratio. The general approach covers Ring-Attention as a special case, and allows the tuning of CommCom ratio with different tile shapes. Importantly, we propose a greedy algorithm that can efficiently search the scheduling space within the tile with restrictions that ensure efficient communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing
