Proactive Service Migration for Long-Running Byzantine Fault Tolerant   Systems

Wenbing Zhao

arXiv:0803.1521·cs.DC·March 12, 2008

Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems

Wenbing Zhao

PDF

Open Access

TL;DR

This paper introduces a migration-based proactive recovery scheme for long-running Byzantine fault tolerant systems that reduces vulnerability windows and improves system availability by eliminating reboot delays and coordinating recovery among replicas.

Contribution

It presents a novel proactive recovery method based on service migration that automatically adapts to system load and enhances fault tolerance without reboot delays.

Findings

01

Reduces vulnerability window in Byzantine fault tolerant systems.

02

Improves system availability during faults.

03

Automatically adjusts to system load to prevent excessive recoveries.

Abstract

In this paper, we describe a novel proactive recovery scheme based on service migration for long-running Byzantine fault tolerant systems. Proactive recovery is an essential method for ensuring long term reliability of fault tolerant systems that are under continuous threats from malicious adversaries. The primary benefit of our proactive recovery scheme is a reduced vulnerability window. This is achieved by removing the time-consuming reboot step from the critical path of proactive recovery. Our migration-based proactive recovery is coordinated among the replicas, therefore, it can automatically adjust to different system loads and avoid the problem of excessive concurrent proactive recoveries that may occur in previous work with fixed watchdog timeouts. Moreover, the fast proactive recovery also significantly improves the system availability in the presence of faults.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Real-Time Systems Scheduling · Software System Performance and Reliability