Making Convolutions Resilient via Algorithm-Based Error Detection   Techniques

Siva Kumar Sastry Hari; Michael B. Sullivan; Timothy Tsai; and Stephen; W. Keckler

arXiv:2006.04984·cs.DC·June 11, 2020

Making Convolutions Resilient via Algorithm-Based Error Detection Techniques

Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, and Stephen, W. Keckler

PDF

TL;DR

This paper investigates the use of checksum-based algorithmic error detection techniques to improve the resilience of CNN convolutions against hardware faults, achieving low overhead and high error coverage.

Contribution

It introduces practical ABED methods for CNN convolutions on optimized GPU platforms, overcoming implementation challenges and demonstrating effective error detection with minimal performance impact.

Findings

01

ABED detects all transient hardware errors in convolutions

02

Runtime overhead of ABED is between 6-23%

03

ABED achieves at least 1.6X throughput compared to full duplication

Abstract

The ability of Convolutional Neural Networks (CNNs) to accurately process real-time telemetry has boosted their use in safety-critical and high-performance computing systems. As such systems require high levels of resilience to errors, CNNs must execute correctly in the presence of hardware faults. Full duplication provides the needed assurance but incurs a prohibitive 100% overhead. Algorithmic techniques are known to offer low-cost solutions, but the practical feasibility and performance of such techniques have never been studied for CNN deployment platforms (e.g., TensorFlow or TensorRT on GPUs). In this paper, we focus on algorithmically verifying Convolutions, which are the most resource-demanding operations in CNNs. We use checksums to verify convolutions, adding a small amount of redundancy, far less than full-duplication. We first identify the challenges that arise in employing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.