QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

Sehyeon Oh; Yongin Kwon; Jemin Lee

arXiv:2604.25306·cs.LG·April 29, 2026

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

Sehyeon Oh, Yongin Kwon, Jemin Lee

PDF

1 Repo

TL;DR

QFlash introduces an end-to-end integer-based softmax for vision transformer attention, enabling faster, more energy-efficient computations without accuracy loss, by overcoming key quantization challenges.

Contribution

It presents the first fully integer implementation of FlashAttention, addressing quantization obstacles and achieving significant speedups and energy savings.

Findings

01

Up to 6.73× speedup over I-ViT

02

Up to 8.69× speedup on Swin models

03

18.8% energy reduction compared to FP16 FlashAttention

Abstract

FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textit{QFlash}, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73 $\times$ speedup over I-ViT and up to 8.69 $\times$ speedup on Swin, while reducing energy consumption by 18.8\% compared to FP16 FlashAttention, without sacrificing Top-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EfficientCompLab/qflash
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.