Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
Kosmas Alexandridis, Vasileios Titopoulos, Giorgos Dimitrakopoulos

TL;DR
This paper introduces a hardware optimization for FlashAttention in transformers, fusing exponential and multiplication operations to reduce area and power consumption in ASIC implementations.
Contribution
It proposes new hardware operators that fuse exponential and multiplication computations, improving efficiency of FlashAttention accelerators.
Findings
28.8% reduction in area
17.6% reduction in power
Enhanced efficiency in ASIC implementations
Abstract
Attention mechanisms, particularly within Transformer architectures and large language models (LLMs), have revolutionized sequence modeling in machine learning and artificial intelligence applications. To compute attention for increasingly long sequences, specialized accelerators have been proposed to execute key attention steps directly in hardware. Among the various recently proposed architectures, those based on variants of the FlashAttention algorithm, originally designed for GPUs, stand out due to their optimized computation, tiling capabilities, and reduced memory traffic. In this work, we focus on optimizing the kernel of floating-point-based FlashAttention using new hardware operators that fuse the computation of exponentials and vector multiplications, e.g., e^x, V. The proposed ExpMul hardware operators significantly reduce the area and power costs of FlashAttention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Focus
