Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM   Errors in the Field

Isaac Boixaderas; Sergi Mor\'e; Javier Bartolome; David Vicente; Petar; Radojkovi\'c; Paul M. Carpenter; Eduard Ayguad\'e

arXiv:2407.16377·cs.DC·September 6, 2024

Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the Field

Isaac Boixaderas, Sergi Mor\'e, Javier Bartolome, David Vicente, Petar, Radojkovi\'c, Paul M. Carpenter, Eduard Ayguad\'e

PDF

TL;DR

This paper introduces a reinforcement learning-based adaptive approach to mitigate uncorrected DRAM errors, significantly reducing lost compute time in supercomputers by predicting and acting on error likelihoods.

Contribution

It presents the first adaptive, RL-based method for uncorrected DRAM error mitigation, optimizing mitigation timing based on error prediction and cost-benefit analysis.

Findings

01

Reduces lost compute time by 54% on supercomputer logs

02

Achieves performance within 6% of an optimal Oracle method

03

Uses only two user-defined parameters for mitigation decision-making

Abstract

Scaling to larger systems, with current levels of reliability, requires cost-effective methods to mitigate hardware failures. One of the main causes of hardware failure is an uncorrected error in memory, which terminates the current job and wastes all computation since the last checkpoint. This paper presents the first adaptive method for triggering uncorrected error mitigation. It uses a prediction approach that considers the likelihood of an uncorrected error and its current potential cost. The method is based on reinforcement learning, and the only user-defined parameters are the mitigation cost and whether the job can be restarted from a mitigation point. We evaluate our method using classical machine learning metrics together with a cost-benefit analysis, which compares the cost of mitigation actions with the benefits from mitigating some of the errors. On two years of production…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.