FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Chi Zhang; Luca Colagrande; Renzo Andri; Thomas Benz; Gamze Islamoglu; Alessandro Nadalini; Francesco Conti; Yawei Li; Luca Benini

arXiv:2505.18824·cs.AR·May 27, 2025

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Chi Zhang, Luca Colagrande, Renzo Andri, Thomas Benz, Gamze Islamoglu, Alessandro Nadalini, Francesco Conti, Yawei Li, Luca Benini

PDF

Open Access

TL;DR

FlatAttention introduces a novel dataflow for multi-head attention on tile-based accelerators, significantly improving utilization and performance while reducing memory bandwidth and die size compared to existing solutions and GPUs.

Contribution

It proposes a new dataflow, FlatAttention, that co-optimizes data movement and fabric primitives for efficient MHA on tile-based accelerators.

Findings

01

Achieves up to 89.3% utilization of processing elements.

02

Provides 4.1x performance speedup over FlashAttention-3.

03

Reduces HBM traffic by 16x and enables 40% less HBM bandwidth compared to Nvidia H100.

Abstract

Multi-Head Attention (MHA) is a critical computational kernel in transformer-based AI models. Emerging scalable tile-based accelerator architectures integrate increasing numbers of tightly-packed processing elements (PEs) with tensor units. MHA dataflow mapping is crucial for achieving high utilization of the available units. We propose FlatAttention, a new dataflow for MHA on tile-based many-PE accelerators, minimizing costly main memory (HBM) accesses by leveraging collective primitives integrated into the on-chip network fabric. FlatAttention achieves up to 89.3% utilization, and 4.1x performance speedup over FlashAttention-3 dataflow on tile-based accelerators whilst reducing HBM traffic by 16x. Through algorithm-architecture co-exploration, we identify an optimal configuration for a large scaled-out tile-based accelerator featuring a 32x32 tile mesh with 1024 TFLOPS @ FP16 peak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Quantum-Dot Cellular Automata