Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators

Kosmas Alexandridis; Vasileios Titopoulos; Giorgos Dimitrakopoulos

arXiv:2505.14314·cs.AR·June 2, 2025

Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators

Kosmas Alexandridis, Vasileios Titopoulos, Giorgos Dimitrakopoulos

PDF

TL;DR

This paper introduces a hardware optimization for FlashAttention in transformers, fusing exponential and multiplication operations to reduce area and power consumption in ASIC implementations.

Contribution

It proposes new hardware operators that fuse exponential and multiplication computations, improving efficiency of FlashAttention accelerators.

Findings

01

28.8% reduction in area

02

17.6% reduction in power

03

Enhanced efficiency in ASIC implementations

Abstract

Attention mechanisms, particularly within Transformer architectures and large language models (LLMs), have revolutionized sequence modeling in machine learning and artificial intelligence applications. To compute attention for increasingly long sequences, specialized accelerators have been proposed to execute key attention steps directly in hardware. Among the various recently proposed architectures, those based on variants of the FlashAttention algorithm, originally designed for GPUs, stand out due to their optimized computation, tiling capabilities, and reduced memory traffic. In this work, we focus on optimizing the kernel of floating-point-based FlashAttention using new hardware operators that fuse the computation of exponentials and vector multiplications, e.g., e^x, V. The proposed ExpMul hardware operators significantly reduce the area and power costs of FlashAttention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Focus