An ECC-based Fault Tolerance Approach for DNNs
Mohsen Raji, Mohammad Zaree, Kimia Soroush

TL;DR
This paper introduces SPW, an ECC-based fault tolerance method for DNNs that detects and corrects bit-flip errors, significantly improving accuracy under fault conditions with manageable area overhead.
Contribution
The paper proposes a novel ECC-based fault tolerance approach for DNNs, enhancing robustness against bit-flip errors with a simple correction and masking strategy.
Findings
Accuracy increases by over 300% at high bit error rates.
Fault detection and correction effectively maintain DNN functionality.
Area overhead is limited to 47.5%.
Abstract
Deep Neural Network (DNN) has achieve great success in solving a wide range of machine learning problems. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. So, their correct functionality in the presence of potential bit-flip errors on DNN parameters stored in memories plays the key role in their applicability in safety-critical applications. In this paper, a fault tolerance approach based on Error Correcting Codes (ECC), called SPW, is proposed to ensure the correct functionality of DNNs in the presence of bit-flip faults. In the proposed approach, error occurrence is detected by the stored ECC and then, it is correct in case of a single-bit error or the weight is completely set to zero (i.e. masked) otherwise. A statistical fault injection campaign is proposed and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Smart Grid Security and Resilience · Software System Performance and Reliability
