Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the Field
Isaac Boixaderas, Sergi Mor\'e, Javier Bartolome, David Vicente, Petar, Radojkovi\'c, Paul M. Carpenter, Eduard Ayguad\'e

TL;DR
This paper introduces a reinforcement learning-based adaptive approach to mitigate uncorrected DRAM errors, significantly reducing lost compute time in supercomputers by predicting and acting on error likelihoods.
Contribution
It presents the first adaptive, RL-based method for uncorrected DRAM error mitigation, optimizing mitigation timing based on error prediction and cost-benefit analysis.
Findings
Reduces lost compute time by 54% on supercomputer logs
Achieves performance within 6% of an optimal Oracle method
Uses only two user-defined parameters for mitigation decision-making
Abstract
Scaling to larger systems, with current levels of reliability, requires cost-effective methods to mitigate hardware failures. One of the main causes of hardware failure is an uncorrected error in memory, which terminates the current job and wastes all computation since the last checkpoint. This paper presents the first adaptive method for triggering uncorrected error mitigation. It uses a prediction approach that considers the likelihood of an uncorrected error and its current potential cost. The method is based on reinforcement learning, and the only user-defined parameters are the mitigation cost and whether the job can be restarted from a mitigation point. We evaluate our method using classical machine learning metrics together with a cost-benefit analysis, which compares the cost of mitigation actions with the benefits from mitigating some of the errors. On two years of production…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
