RetryGuard: Preventing Self-Inflicted Retry Storms in Cloud Microservices Applications
Jhonatan Tavori, Anat Bremler-Barr, Hanoch Levy, Ofek Lavi

TL;DR
RetryGuard is a distributed framework designed to prevent retry storms in cloud microservices, reducing resource usage and operational costs by managing retry policies based on an analytic model.
Contribution
It introduces RetryGuard, a novel distributed control system that manages retry patterns across microservices to prevent retry storms and optimize resource utilization.
Findings
Significantly reduces resource usage and costs compared to AWS default policies.
Demonstrates scalability and performance improvements in Kubernetes with Istio.
Effectively prevents retry storms and mitigates resource contention.
Abstract
Modern cloud applications are built on independent, diverse microservices, offering scalability, flexibility, and usage-based billing. However, the structural design of these varied services, along with their reliance on auto-scalers for dynamic internet traffic, introduces significant coordination challenges. As we demonstrate in this paper, common default retry patterns used between misaligned services can turn into retry storms which drive up resource usage and costs, leading to self-inflicted Denial-of-Wallet (DoW) scenarios. To overcome these problems we introduce RetryGuard, a distributed framework for productive control of retry patterns across interdependent microservices. By managing retry policy on a per-service basis and making parallel decisions, RetryGuard prevents retry storms, curbs resource contention, and mitigates escalating operational costs. RetryGuard makes its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software-Defined Networks and 5G
