BALLAST: Bandit-Assisted Learning for Latency-Aware Stable Timeouts in Raft
Qizhi Wang

TL;DR
BALLAST introduces a bandit-based adaptive timeout mechanism for Raft, improving stability and recovery times under challenging network conditions by replacing static heuristics with contextual learning.
Contribution
The paper proposes BALLAST, a novel online adaptation method using linear contextual bandits to optimize Raft timeouts during network instability.
Findings
Reduces recovery time in WAN scenarios
Decreases unwritable time during network turbulence
Performs competitively on stable networks
Abstract
Randomized election timeouts are a simple and effective liveness heuristic for Raft, but they become brittle under long-tail latency, jitter, and partition recovery, where repeated split votes can inflate unavailability. This paper presents BALLAST, a lightweight online adaptation mechanism that replaces static timeout heuristics with contextual bandits. BALLAST selects from a discrete set of timeout "arms" using efficient linear contextual bandits (LinUCB variants), and augments learning with safe exploration to cap risk during unstable periods. We evaluate BALLAST on a reproducible discrete-event simulation with long-tail delay, loss, correlated bursts, node heterogeneity, and partition/recovery turbulence. Across challenging WAN regimes, BALLAST substantially reduces recovery time and unwritable time compared to standard randomized timeouts and common heuristics, while remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Traffic and Congestion Control · Internet Traffic Analysis and Secure E-voting · Peer-to-Peer Network Technologies
