TL;DR
QFlash introduces an end-to-end integer-based softmax for vision transformer attention, enabling faster, more energy-efficient computations without accuracy loss, by overcoming key quantization challenges.
Contribution
It presents the first fully integer implementation of FlashAttention, addressing quantization obstacles and achieving significant speedups and energy savings.
Findings
Up to 6.73× speedup over I-ViT
Up to 8.69× speedup on Swin models
18.8% energy reduction compared to FP16 FlashAttention
Abstract
FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textit{QFlash}, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73 speedup over I-ViT and up to 8.69 speedup on Swin, while reducing energy consumption by 18.8\% compared to FP16 FlashAttention, without sacrificing Top-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
