ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures
Haoxuan Liu, Vasu Singh, Micha{\l} Filipiuk, and Siva Kumar Sastry, Hari

TL;DR
ALBERTA is a framework that enhances the reliability of transformer architectures in safety-critical applications by efficiently detecting and correcting errors with minimal overhead, ensuring high error coverage.
Contribution
It introduces a novel resilience analysis and protection method for transformers, focusing on vulnerable GEMM layers with checksum-based error detection and self-correction mechanisms.
Findings
Achieves over 99% error coverage with minimal overhead.
Effectively protects GEMM layers in transformer models.
Applicable across various GPU architectures and precisions.
Abstract
Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability. It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors. We propose a novel algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis and protection of transformer-based architectures. First, our work develops an efficient process of computing and ranking the resilience of transformers layers. We find that due to the large size of transformer models, applying traditional network redundancy to a subset of the most vulnerable layers provides high error coverage albeit with impractically high overhead. We address this shortcoming by providing a software-directed, checksum-based error detection technique aimed at protecting the most vulnerable general matrix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Advanced Memory and Neural Computing · Advanced Neural Network Applications
