RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

Khurram Khalil; Muhammad Mahad Khaliq; Khaza Anuarul Hoque

arXiv:2512.09829·cs.AI·December 11, 2025

RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque

PDF

Open Access

TL;DR

RIFT is a reinforcement learning-based framework that efficiently identifies critical fault scenarios in large AI accelerators, significantly reducing testing time and cost while improving fault coverage and enabling better hardware protection strategies.

Contribution

The paper introduces RIFT, a novel reinforcement learning-guided methodology for scalable, efficient fault assessment in large AI accelerators, outperforming traditional methods in speed and coverage.

Findings

01

Achieves 2.2× faster fault assessment than evolutionary methods.

02

Reduces test vectors by over 99% compared to random injection.

03

Provides 12.8× cost-effectiveness improvement with selective error correction.

Abstract

The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2 $\times$ } fault assessment speedup over evolutionary methods and reduces the required test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiation Effects in Electronics · Software Testing and Debugging Techniques · VLSI and Analog Circuit Testing