VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

Yupeng Sun; Yanzhao Li; Zhiqiang Zou; Bai Du; Zhiyuan Zhang; Hui Dong; Gaoyige Fan; Hui Wang

arXiv:2604.12798·cs.LG·April 15, 2026

VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

Yupeng Sun, Yanzhao Li, Zhiqiang Zou, Bai Du, Zhiyuan Zhang, Hui Dong, Gaoyige Fan, Hui Wang

PDF

TL;DR

VFA introduces a hardware-friendly method to optimize online softmax in FlashAttention, reducing bottlenecks and achieving significant speedups on modern accelerators.

Contribution

It proposes Vector Relieved Flash Attention (VFA), a novel technique that stabilizes maximum computations and integrates with sparse methods to improve efficiency.

Findings

01

VFA stabilizes running maximum early in attention computations.

02

VFA and VSA avoid conditional rescale operations, reducing latency.

03

Speedups of nearly two times over baseline on modern hardware.

Abstract

FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.