Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units
Juan-David Guerrero-Balaguera (DAUIN), Josie E. Rodriguez Condia, (DAUIN), Fernando F. dos Santos (TARAN), Matteo Sonza (DAUIN), Paolo Rech

TL;DR
This paper presents a method to evaluate the impact of permanent faults in GPU scheduler and control units, providing quantification and software mapping of errors affecting GPU reliability during high-performance computing and neural network tasks.
Contribution
It introduces a novel fault injection approach that significantly reduces evaluation time and characterizes the effects of permanent faults in GPU control units on software execution.
Findings
Up to 99% of permanent errors impact software execution.
Faults can modify opcodes, addresses, and thread statuses.
45% of errors cause silent data corruptions.
Abstract
Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Advanced Memory and Neural Computing
