H-FA: A Hybrid Floating-Point and Logarithmic Approach to Hardware Accelerated FlashAttention
Kosmas Alexandridis, Giorgos Dimitrakopoulos

TL;DR
H-FA introduces a hybrid floating-point and logarithmic approach to hardware-accelerated FlashAttention, significantly reducing area and power consumption while maintaining performance in transformer attention computations.
Contribution
It proposes a novel hybrid computation method combining floating-point and fixed-point logarithmic representations for efficient hardware implementation of FlashAttention.
Findings
Achieves 26.5% area reduction in hardware
Reduces power consumption by 23.4%
Maintains performance comparable to existing architectures
Abstract
Transformers have significantly advanced AI and machine learning through their powerful attention mechanism. However, computing attention on long sequences can become a computational bottleneck. FlashAttention mitigates this by fusing the softmax and matrix operations into a tiled computation pattern that decouples performance from sequence length. Though designed for GPUs, its simplicity also makes it well suited for direct hardware acceleration. To improve hardware implementation, we compute FlashAttention using a mixture of floating-point and fixed-point logarithm domain representations. Floating-point is used to compute attention scores from query and key matrices, while logarithmic computation simplifies the fused computation of softmax normalization and the multiplication with the value matrix. This transformation, called H-FA, replaces vector-wide floating-point multiplication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical Methods and Algorithms · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
