Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Vasileios Titopoulos; Kosmas Alexandridis; Giorgos Dimitrakopoulos

arXiv:2507.16676·cs.LG·July 23, 2025

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos

PDF

Open Access

TL;DR

This paper introduces Flash-ABFT, a novel fault tolerance method for attention layers in transformers that efficiently detects hardware errors with minimal overhead, enhancing reliability of AI accelerators.

Contribution

It presents a new checksum-based fault detection technique for entire attention layers, including softmax, reducing overhead compared to traditional methods.

Findings

01

Only 5.3% hardware area overhead

02

Less than 1.9% energy overhead

03

High fault-detection accuracy

Abstract

Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiation Effects in Electronics · Advanced Memory and Neural Computing · VLSI and Analog Circuit Testing