DRAM Errors and Cosmic Rays: Space Invaders or Science Fiction?
Isaac Boixaderas, Jorge Amaya, Sergi Mor\'e, Javier Bartolome, David, Vicente, Osman Unsal, Dimitris Gizopoulos, Paul M. Carpenter, Petar, Radojkovi\'c, Eduard Ayguad\'e

TL;DR
This study investigates the potential link between cosmic rays and DRAM errors in HPC systems, finding no evidence of influence in the analyzed environments, challenging previous assumptions about cosmic rays as a primary cause.
Contribution
The paper provides a large-scale empirical analysis of cosmic ray effects on DRAM errors in production HPC environments, using advanced statistical and machine learning methods.
Findings
No correlation between cosmic rays and DRAM errors detected
Analysis covers over 2000 billion MB-hours of error logs
Results suggest cosmic rays may not significantly impact DRAM errors in studied environments
Abstract
It is widely accepted that cosmic rays are a plausible cause of DRAM errors in high-performance computing (HPC) systems, and various studies suggest that they could explain some aspects of the observed DRAM error behavior. However, this phenomenon is insufficiently studied in production environments. We analyze the correlations between cosmic rays and DRAM errors on two HPC clusters: a production supercomputer with server-class DDR3-1600 and a prototype with LPDDR3-1600 and no hardware error correction. Our error logs cover 2000 billion MB-hours for the MareNostrum 3 supercomputer and 135 million MB-hours for the Mont-Blanc prototype. Our analysis combines quantitative analysis, formal statistical methods and machine learning. We detect no indications that cosmic rays have any influence on the DRAM errors. To understand whether the findings are specific to systems under study, located…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
