Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators
Abhishek Tyagi, Yiming Gan, Shaoshan Liu, Bo Yu, Paul, Whatmough, Yuhao Zhu

TL;DR
This paper introduces a novel method to accurately estimate the impact of transient hardware faults on DNN accuracy, revealing greater vulnerability than previous metrics suggested, and aiding in designing more resilient networks.
Contribution
It presents a new algorithm for precise RA estimation under transient faults and reformulates the problem as a Monte Carlo integration with importance sampling, validated by hardware.
Findings
Transient faults cause more accuracy degradation than existing tools estimate.
The proposed RA estimation method is lightweight and hardware-validated.
RA estimation can guide the design of more resilient DNN architectures.
Abstract
As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primarily focuses on metrics such as Failures In Time (FIT) rate and the Silent Data Corruption (SDC) rate, which quantify how often a device fails. Instead, this paper focuses on quantifying the DNN accuracy given that a transient error has occurred, which tells us how well a network behaves when a transient error occurs. We call this metric Resiliency Accuracy (RA). We show that existing RA formulation is fundamentally inaccurate, because it incorrectly assumes that software variables (model weights/activations) have equal faulty probability under hardware transient faults. We present an algorithm that captures the faulty probabilities of DNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Software Reliability and Analysis Research · Reliability and Maintenance Optimization
