A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration
Jun-Peng Zhu, Qizhi Wang, Yulong Zhai, Yishen Sun, Sen Chen, Kai Xu, Peng Cai, Hongming Zhang, Heng Long, Liu Tang, Qi Liu

TL;DR
SCOUT is a practical, state-aware framework for triaging flaky failures in distributed database CI, using only causal features and calibration techniques to improve decision accuracy under workload shifts.
Contribution
The paper introduces SCOUT, a novel online triage framework that uses causal features, calibration, and posterior correction for reliable flaky failure detection in distributed database CI.
Findings
SCOUT achieves high accuracy in flaky failure detection.
SCOUT operates with low latency (1.17 ms P95) in production.
Effective on diverse datasets including TiDB and GitHub Actions.
Abstract
Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Advanced Database Systems and Queries
