Fault-Tolerant Adaptive Parallel and Distributed Simulation
Gabriele D'Angelo, Stefano Ferretti, Moreno Marzolla, Lorenzo Armaroli

TL;DR
This paper introduces FT-GAIA, a fault-tolerant extension of a parallel simulation middleware, which replicates simulation entities and messages to enable robust simulation on large, failure-prone HPC systems.
Contribution
The paper presents FT-GAIA, a software-based fault-tolerant extension for parallel simulation middleware that handles crash and Byzantine failures through replication and synchronization.
Findings
High fault tolerance achieved with moderate computational overhead
Effective crash-failure tolerance through entity replication
Protection against Byzantine failures via message replication
Abstract
Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
