Co-evolving Tracing and Fault Injection with Box of Pain
Daniel Bittman, Ethan L. Miller, Peter Alvaro

TL;DR
Box of Pain is a lightweight tool that combines tracing and fault injection at the system call level to better understand and test distributed systems' robustness against faults.
Contribution
It introduces a novel approach that interposes at the system call level to reconstruct causal relationships and simulate partial failures in unmodified distributed systems.
Findings
Effective reconstruction of communication event order
Demonstrated ability to simulate partial failures
Lightweight approach suitable for real systems
Abstract
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain's lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Scientific Computing and Data Management
