Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUs
Fernando Fernandes dos Santos, Sujit Malde, Carlo Cazzaniga,, Christopher Frost, Luigi Carro, Paolo Rech

TL;DR
This paper investigates the sources of Detected Unrecoverable Errors in GPUs exposed to neutron beams, identifying key causes and evaluating the impact of ECC on error reduction.
Contribution
It provides empirical data on error sources in GPUs under neutron exposure and quantifies the effectiveness of ECC in reducing specific error types.
Findings
ECC reduces DUEs caused by Illegal Address access by up to 92% in Kepler GPUs.
ECC reduces DUEs caused by Illegal Address access by up to 98% in Volta GPUs.
Illegal memory accesses and interface errors are significant sources of DUEs.
Abstract
We investigate the sources of Detected Unrecoverable Errors (DUEs) in GPUs exposed to neutron beams. Illegal memory accesses and interface errors are among the more likely sources of DUEs. ECC increases the launch failure events. Our test procedure has shown that ECC can reduce the DUEs caused by Illegal Address access up to 92% for Kepler and 98% for Volta.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Nuclear reactor physics and engineering
