Memory Vulnerability: A Case for Delaying Error Reporting

Luc Jaulmes; Miquel Moret\'o; Mateo Valero; Marc Casas

arXiv:1810.06472·cs.AR·August 2, 2023·1 cites

Memory Vulnerability: A Case for Delaying Error Reporting

Luc Jaulmes, Miquel Moret\'o, Mateo Valero, Marc Casas

PDF

Open Access

TL;DR

This paper formalizes and evaluates the Memory Vulnerability Factor (MVF) and introduces the False Error Aware MVF (FEA), demonstrating their effectiveness in accurately estimating memory error impact at runtime.

Contribution

It extends the MVF metric to account for false errors and demonstrates FEA's superior correlation with actual program outcome errors.

Findings

01

FEA provides a tighter upper bound on error probability than MVF.

02

Both MVF and FEA are safe for runtime use to estimate error impact.

03

FEA correlates best with the likelihood of incorrect program outcomes.

Abstract

To face future reliability challenges, it is necessary to quantify the risk of error in any part of a computing system. To this goal, the Architectural Vulnerability Factor (AVF) has long been used for chips. However, this metric is used for offline characterisation, which is inappropriate for memory. We survey the literature and formalise one of the metrics used, the Memory Vulnerability Factor, and extend it to take into account false errors. These are reported errors which would have no impact on the program if they were ignored. We measure the False Error Aware MVF (FEA) and related metrics precisely in a cycle-accurate simulator, and compare them with the effects of injecting faults in a program's data, in native parallel runs. Our findings show that MVF and FEA are the only two metrics that are safe to use at runtime, as they both consistently give an upper bound on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance