Doubt and Redundancy Kill Soft Errors -- Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software
Philipp Samfass, Tobias Weinzierl, Anne Reinarz, Michael, Bader

TL;DR
This paper introduces a task-based soft error detection and correction scheme for high-performance numerical software that uses outcome error criteria and task redundancy to identify and fix silent data corruptions with minimal performance impact.
Contribution
It presents a novel resilient algorithm that detects and corrects silent floating-point errors through outcome evaluation and task redundancy, maintaining efficiency.
Findings
Effective detection of silent data corruption with minimal overhead
Redundant task execution improves error correction accuracy
Domain-specific tuning of error criteria enhances reliability
Abstract
Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how ``dubious'' an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e.~is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
