VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
Yupeng Sun, Yanzhao Li, Zhiqiang Zou, Bai Du, Zhiyuan Zhang, Hui Dong, Gaoyige Fan, Hui Wang

TL;DR
VFA introduces a hardware-friendly method to optimize online softmax in FlashAttention, reducing bottlenecks and achieving significant speedups on modern accelerators.
Contribution
It proposes Vector Relieved Flash Attention (VFA), a novel technique that stabilizes maximum computations and integrates with sparse methods to improve efficiency.
Findings
VFA stabilizes running maximum early in attention computations.
VFA and VSA avoid conditional rescale operations, reducing latency.
Speedups of nearly two times over baseline on modern hardware.
Abstract
FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
