SystolicAttention: Fusing FlashAttention within a Single Systolic Array
Jiawei Lin, Yuanlong Li, Guokai Chen, Thomas Bourgeat

TL;DR
This paper introduces FSA, a novel systolic array architecture that fully accelerates FlashAttention operations within a single array, significantly improving hardware utilization and performance for transformer models.
Contribution
The paper proposes FSA, an enhanced systolic array design that executes entire FlashAttention operations internally, optimizing resource use and maintaining operation order, with implementation and performance validation.
Findings
FSA achieves 1.77x higher FLOPs/s utilization than AWS Neuron-v2.
FSA achieves 4.83x higher FLOPs/s utilization than Google TPUv5e.
FSA has only 12% area overhead in 16 nm technology.
Abstract
Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies
MethodsSoftmax
