Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
Mohammad Hasan Ahmadilivani, Marten Roots, Marco Restifo, Sven-Markus Loorits, Luca Di Mauro, Jaan Raik

TL;DR
This paper introduces two lightweight, memory-efficient methods, MSET and CEP, that significantly improve the reliability of large-scale deep neural networks in safety-critical applications, outperforming traditional ECC schemes.
Contribution
The authors propose novel ECC alternatives, MSET and CEP, which enhance DNN reliability with lower area and delay overheads compared to conventional ECC methods.
Findings
Both methods outperform SECDED ECC in reliability.
ViTs can be protected by safeguarding high exponent bits.
CEP achieves up to 10x higher BER resilience with lower area and faster decoding.
Abstract
Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
