Cast: Automated Resilience Testing for Production Cloud Service Systems
Zhuangbin Chen, Zhiling Deng, Kaiming Zhang, Yang Liu, Cheng Cui, Jinfeng Zhong, Zibin Zheng

TL;DR
Cast is an automated framework that tests microservice resilience in production by replaying real traffic and injecting faults, effectively identifying vulnerabilities and improving system reliability.
Contribution
It introduces a comprehensive, automated resilience testing framework for production microservices that combines traffic replay, fault injection, and intelligent test prioritization.
Findings
Identified 137 potential vulnerabilities in large-scale applications.
Confirmed 89 vulnerabilities through developer validation.
Achieved 90% coverage on a benchmark set of bugs.
Abstract
The distributed nature of microservice architecture introduces significant resilience challenges. Traditional testing methods, limited by extensive manual effort and oversimplified test environments, fail to capture production system complexity. To address these limitations, we present Cast, an automated, end-to-end framework for microservice resilience testing in production. It achieves high test fidelity by replaying production traffic against a comprehensive library of application-level faults to exercise internal error-handling logic. To manage the combinatorial test space, Cast employs a complexity-driven strategy to systematically prune redundant tests and prioritize high-value tests targeting the most critical service execution paths. Cast automates the testing lifecycle through a three-phase pipeline (i.e., startup, fault injection, and recovery) and uses a multi-faceted oracle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Software-Defined Networks and 5G
