Real Life Is Uncertain. Consensus Should Be Too!

Reginald Frank; Soujanya Ponnapalli; Octavio Lomeli; Neil Giridharan; Marcos K Aguilera; and Natacha Crooks

arXiv:2602.11362·cs.DC·February 13, 2026

Real Life Is Uncertain. Consensus Should Be Too!

Reginald Frank, Soujanya Ponnapalli, Octavio Lomeli, Neil Giridharan, Marcos K Aguilera, and Natacha Crooks

PDF

Open Access

TL;DR

This paper advocates for adopting a probabilistic failure model in distributed consensus protocols, moving beyond traditional fixed-threshold models to better reflect real-world fault complexities, enabling more reliable and efficient systems.

Contribution

It introduces a probabilistic failure model for consensus protocols, allowing for optimization based on individual machine failure behaviors rather than fixed failure thresholds.

Findings

01

Probabilistic models better capture real-world fault behaviors.

02

Potential to improve system reliability and efficiency.

03

Enables bypassing traditional quorum intersection bottlenecks.

Abstract

Modern distributed systems rely on consensus protocols to build a fault-tolerant-core upon which they can build applications. Consensus protocols are correct under a specific failure model, where up to $f$ machines can fail. We argue that this $f$ -threshold failure model oversimplifies the real world and limits potential opportunities to optimize for cost or performance. We argue instead for a probabilistic failure model that captures the complex and nuanced nature of faults observed in practice. Probabilistic consensus protocols can explicitly leverage individual machine \textit{failure curves} and explore side-stepping traditional bottlenecks such as majority quorum intersection, enabling systems that are more reliable, efficient, cost-effective, and sustainable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Cloud Computing and Resource Management