# Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics   Simulations

**Authors:** Aur\'elien Cavelan, Rub\'en M. Cabez\'on, Florina M. Ciorba

arXiv: 1904.10221 · 2020-05-19

## TL;DR

This paper introduces Selective Particle Replication (SPR), a novel method for detecting silent data corruptions in Smoothed Particle Hydrodynamics simulations, achieving high detection rates with minimal overhead.

## Contribution

SPR is the first particle-based replication method specifically designed to detect silent data corruptions in SPH simulations, improving reliability in large-scale scientific computing.

## Key findings

- SPR detects 91-99.9% of SDCs with no false positives.
- SPR incurs 1-10% overhead in HPC environments.
- Effective in error propagation scenarios in SPH simulations.

## Abstract

Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on an HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.10221/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1904.10221/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/1904.10221/full.md

---
Source: https://tomesphere.com/paper/1904.10221