Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults
Youssef A. Ait Alama, Sampada Sakpal, Ke Wang, Razvan Bunescu, Avinash, Karanth, and Ahmed Louri

TL;DR
This paper introduces algorithmic methods to sustainably reuse neural network accelerators with permanent hardware faults by leveraging the faulty components' behavior, avoiding hardware modifications, and maintaining high accuracy.
Contribution
It presents novel fault mitigation algorithms that utilize existing hardware features to tolerate permanent faults in systolic array-based NN accelerators.
Findings
Fault-tolerant techniques maintain near fault-free accuracy.
Methods work without hardware modifications.
Effective on various neural network architectures and datasets.
Abstract
Hardware failures are a growing challenge for machine learning accelerators, many of which are based on systolic arrays. When a permanent hardware failure occurs in a systolic array, existing solutions include localizing and isolating the faulty processing element (PE), using a redundant PE for re-execution, or in some extreme cases decommissioning the entire accelerator for further investigation. In this paper, we propose novel algorithmic approaches that mitigate permanent hardware faults in neural network (NN) accelerators by uniquely integrating the behavior of the faulty component instead of bypassing it. In doing so, we aim for a more sustainable use of the accelerator where faulty hardware is neither bypassed nor discarded, instead being given a second life. We first introduce a CUDA-accelerated systolic array simulator in PyTorch, which enabled us to quantify the impact of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Advanced Data Processing Techniques
