Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
Gabriele D'Angelo, Stefano Ferretti, Moreno Marzolla

TL;DR
FT-GAIA is a middleware that enhances fault tolerance in parallel and distributed simulations by using functional replication of entities and messages, effectively handling crash and Byzantine failures with moderate overhead.
Contribution
This work introduces FT-GAIA, a novel fault-tolerant middleware for PADS that employs functional replication to handle crash and Byzantine failures.
Findings
High fault tolerance achieved with moderate computational overhead.
Effective handling of crash failures through entity replication.
Protection against Byzantine failures via message replication.
Abstract
This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
