FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators
Chi Zhang, Luca Colagrande, Renzo Andri, Luca Benini

TL;DR
FlatAttention introduces a dataflow optimized for tile-based accelerators, significantly improving efficiency and throughput for large attention-based model inference across various variants.
Contribution
It presents a novel dataflow that reduces memory traffic and improves utilization for attention variants on tile-based accelerators, outperforming existing solutions.
Findings
Achieves up to 92.3% utilization and 4.1x speedup over FlashAttention-3.
Reduces HBM traffic by 16x and generalizes across multiple attention variants.
Improves end-to-end system throughput by 1.9x and reduces latency by 1.4x on wafer-scale systems.
Abstract
Attention accounts for an increasingly dominant fraction of total computation during inference for mixture-of-experts (MoE) models, making efficient acceleration critical. Emerging domain-specific accelerators for large model inference are shifting toward chip-scale and wafer-scale tile-based architectures. Tiles contain large matrix and vector engines and are connected through on-chip interconnects, which support tile-to-tile traffic to reduce the tile-to-main-memory traffic bottleneck. Hence, dataflow management is crucial to achieve high utilization. We propose FlatAttention, a dataflow for modern attention variants on tile-based accelerators. FlatAttention minimizes expensive high-bandwidth memory (HBM) accesses by exploiting collective primitives integrated into the on-chip network fabric, achieving up to 92.3% utilization, 4.1x speedup over FlashAttention-3, and 16x lower HBM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
