Fault Tolerance in Distributed Systems using Fused State Machines
Bharath Balasubramanian, Vijay K. Garg

TL;DR
This paper introduces a fusion technique for fault tolerance in distributed systems modeled as deterministic finite state machines, reducing backup requirements from nf to f machines and saving resources.
Contribution
It presents a novel (f,f)-fusion framework and algorithm that significantly reduces backup machine count while maintaining fault correction capabilities.
Findings
Average state space savings of 38% with fusion
Fusion reduces backup machines from nf to f
Application to MapReduce shows resource savings
Abstract
Replication is a standard technique for fault tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash or f/2 Byzantine faults among n different machines, replication requires nf additional backup machines. We present a solution called fusion that requires just f additional backup machines. First, we build a framework for fault tolerance in DFSMs based on the notion of Hamming distances. We introduce the concept of an (f,m)-fusion, which is a set of m backup machines that can correct f crash faults or f/2 Byzantine faults among a given set of machines. Second, we present an algorithm to generate an (f,f)-fusion for a given set of machines. We ensure that our backups are efficient in terms of the size of their state and event sets. Our evaluation of fusion on the widely used MCNC'91 benchmarks for DFSMs show that the average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Service-Oriented Architecture and Web Services
