FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Chi Zhang; Luca Colagrande; Renzo Andri; Luca Benini

arXiv:2604.02110·cs.AR·April 3, 2026

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Chi Zhang, Luca Colagrande, Renzo Andri, Luca Benini

PDF

TL;DR

FlatAttention introduces a dataflow optimized for tile-based accelerators, significantly improving efficiency and throughput for large attention-based model inference across various variants.

Contribution

It presents a novel dataflow that reduces memory traffic and improves utilization for attention variants on tile-based accelerators, outperforming existing solutions.

Findings

01

Achieves up to 92.3% utilization and 4.1x speedup over FlashAttention-3.

02

Reduces HBM traffic by 16x and generalizes across multiple attention variants.

03

Improves end-to-end system throughput by 1.9x and reduces latency by 1.4x on wafer-scale systems.

Abstract

Attention accounts for an increasingly dominant fraction of total computation during inference for mixture-of-experts (MoE) models, making efficient acceleration critical. Emerging domain-specific accelerators for large model inference are shifting toward chip-scale and wafer-scale tile-based architectures. Tiles contain large matrix and vector engines and are connected through on-chip interconnects, which support tile-to-tile traffic to reduce the tile-to-main-memory traffic bottleneck. Hence, dataflow management is crucial to achieve high utilization. We propose FlatAttention, a dataflow for modern attention variants on tile-based accelerators. FlatAttention minimizes expensive high-bandwidth memory (HBM) accesses by exploiting collective primitives integrated into the on-chip network fabric, achieving up to 92.3% utilization, 4.1x speedup over FlashAttention-3, and 16x lower HBM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.