Making Convolutions Resilient via Algorithm-Based Error Detection Techniques
Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, and Stephen, W. Keckler

TL;DR
This paper investigates the use of checksum-based algorithmic error detection techniques to improve the resilience of CNN convolutions against hardware faults, achieving low overhead and high error coverage.
Contribution
It introduces practical ABED methods for CNN convolutions on optimized GPU platforms, overcoming implementation challenges and demonstrating effective error detection with minimal performance impact.
Findings
ABED detects all transient hardware errors in convolutions
Runtime overhead of ABED is between 6-23%
ABED achieves at least 1.6X throughput compared to full duplication
Abstract
The ability of Convolutional Neural Networks (CNNs) to accurately process real-time telemetry has boosted their use in safety-critical and high-performance computing systems. As such systems require high levels of resilience to errors, CNNs must execute correctly in the presence of hardware faults. Full duplication provides the needed assurance but incurs a prohibitive 100% overhead. Algorithmic techniques are known to offer low-cost solutions, but the practical feasibility and performance of such techniques have never been studied for CNN deployment platforms (e.g., TensorFlow or TensorRT on GPUs). In this paper, we focus on algorithmically verifying Convolutions, which are the most resource-demanding operations in CNNs. We use checksums to verify convolutions, adding a small amount of redundancy, far less than full-duplication. We first identify the challenges that arise in employing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
