RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning
Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque

TL;DR
RIFT is a reinforcement learning-based framework that efficiently identifies critical fault scenarios in large AI accelerators, significantly reducing testing time and cost while improving fault coverage and enabling better hardware protection strategies.
Contribution
The paper introduces RIFT, a novel reinforcement learning-guided methodology for scalable, efficient fault assessment in large AI accelerators, outperforming traditional methods in speed and coverage.
Findings
Achieves 2.2× faster fault assessment than evolutionary methods.
Reduces test vectors by over 99% compared to random injection.
Provides 12.8× cost-effectiveness improvement with selective error correction.
Abstract
The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2} fault assessment speedup over evolutionary methods and reduces the required test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Software Testing and Debugging Techniques · VLSI and Analog Circuit Testing
