Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Imran S. Haque, Vijay S. Pande

TL;DR
This study assesses real-world memory error rates in GPUs, revealing that a significant portion of consumer GPUs experience pattern-sensitive soft errors, which depend on architecture and environmental factors.
Contribution
The paper introduces MemtestG80, a tool for measuring GPU memory errors, and provides the first large-scale assessment of error rates in real-world GPU deployments.
Findings
Two-thirds of tested GPUs exhibit detectable memory soft errors.
Errors are pattern-sensitive and persist after controlling for overclocking and temperature.
Error rates depend strongly on GPU architecture.
Abstract
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in error-intolerant applications is largely unproven. In particular, a lack of error checking and correcting (ECC) capability in the memory subsystems of graphics cards has been cited as a hindrance to the acceptance of GPUs as high-performance coprocessors, but the impact of this design has not been previously quantified. In this article we present MemtestG80, our software for assessing memory error rates on NVIDIA G80 and GT200-architecture-based graphics cards. Furthermore, we present the results of a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@home distributed computing network. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Software System Performance and Reliability
