FLASH-D: FlashAttention with Hidden Softmax Division

Kosmas Alexandridis; Vasileios Titopoulos; Giorgos Dimitrakopoulos

arXiv:2505.14201·cs.LG·May 21, 2025

FLASH-D: FlashAttention with Hidden Softmax Division

Kosmas Alexandridis, Vasileios Titopoulos, Giorgos Dimitrakopoulos

PDF

Open Access

TL;DR

FLASH-D introduces a simplified, hardware-efficient formulation of FlashAttention that reduces computational cost, area, and power consumption while maintaining core properties and numerical stability for faster transformer attention computations.

Contribution

It presents a mathematically equivalent reformulation of FlashAttention that simplifies implementation and enhances hardware efficiency without sacrificing accuracy or performance.

Findings

01

Achieves 22.8% reduction in hardware area

02

Achieves 20.3% reduction in power consumption

03

Maintains numerical stability and core properties of FlashAttention

Abstract

The transformer's attention mechanism has revolutionized AI and machine learning, with its efficient computation being crucial to its performance. However, calculating attention involves matrix operations interspersed with softmax rescaling, which inherently slows down computation and requires processing the entire input sequence. Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic, enabling tiled computation independent of sequence length. While optimized for GPUs, FlashAttention's simplicity makes it amenable to direct hardware acceleration. This work re-evaluates the core FlashAttention kernel, presenting FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Low-power high-performance VLSI design · Numerical Methods and Algorithms

MethodsAttention Is All You Need · Softmax