V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning
Yiheng Gao, Qin Hua, and Zizhong Chen

TL;DR
V-ABFT introduces a variance-based adaptive threshold method for fault detection in matrix multiplication, significantly improving accuracy and efficiency over existing approaches across various precisions and hardware platforms.
Contribution
The paper presents V-ABFT, a novel variance-based adaptive threshold algorithm that offers tighter error bounds and lower false positives compared to prior probabilistic methods.
Findings
Reduces threshold-to-error ratio by 6-48× across precisions.
Enables ~1000× finer detection granularity in low-precision GEMM.
Validated effectiveness on synthetic data and real models like LLaMA-7B, GPT-2, ViT.
Abstract
Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical challenges: analytical bounds are overly conservative, while probabilistic approaches like A-ABFT yield thresholds -- larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately -- for FP32/FP64 and -- for BF16, representing a \textbf{6--48 improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Radiation Effects in Electronics
