eFAT: Improving the Effectiveness of Fault-Aware Training for Mitigating Permanent Faults in DNN Hardware Accelerators
Muhammad Abdullah Hanif, Muhammad Shafique

TL;DR
eFAT is a novel framework that optimizes fault-aware training by assessing DNN resilience, enabling consolidated retraining across multiple faulty chips, thus significantly reducing retraining overheads in fault-prone DNN hardware accelerators.
Contribution
The paper introduces eFAT, a framework that computes DNN resilience and groups fault maps to minimize retraining costs for multiple faulty chips.
Findings
Reduces retraining overheads for fault mitigation in DNN accelerators.
Effectively groups fault maps to enable consolidated retraining.
Maintains accuracy constraints while optimizing fault-aware training.
Abstract
Fault-Aware Training (FAT) has emerged as a highly effective technique for addressing permanent faults in DNN accelerators, as it offers fault mitigation without significant performance or accuracy loss, specifically at low and moderate fault rates. However, it leads to very high retraining overheads, especially when used for large DNNs designed for complex AI applications. Moreover, as each fabricated chip can have a distinct fault pattern, FAT is required to be performed for each faulty chip individually, considering its unique fault map, which further aggravates the problem. To reduce the overheads of FAT while maintaining its benefits, we propose (1) the concepts of resilience-driven retraining amount selection, and (2) resilience-driven grouping and fusion of multiple fault maps (belonging to different chips) to perform consolidated retraining for a group of faulty chips. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · VLSI and Analog Circuit Testing · Semiconductor materials and devices
