Optimizing Waiting Thresholds Within A State Machine
Rohit Pandey, Yifan Chang, Cameron White, Gaurav Jagtiani, Aerin Young, Kim, Gil Lapid Shafriri, Sathya Singh

TL;DR
This paper develops methods to optimize node recovery thresholds in cloud data centers, minimizing downtime by fitting probabilistic models and using numerical optimization for complex multi-threshold scenarios.
Contribution
It introduces a regression-based approach for customizing recovery thresholds based on node features and extends the optimization to multiple thresholds using numerical techniques.
Findings
Heavy-tail distributions effectively model node recovery times.
Node-specific features improve threshold optimization accuracy.
Gradient descent successfully optimizes multiple intertwined thresholds.
Abstract
Azure (the cloud service provided by Microsoft) is composed of physical computing units which are called nodes. These nodes are controlled by a software component called Fabric Controller (FC), which can consider the nodes to be in one of many different states such as Ready, Unhealthy, Booting, etc. Some of these states correspond to a node being unresponsive to FCs requests. When a node goes unresponsive for more than a set threshold, FC intervenes and reboots the node. We minimized the downtime caused by the intervention threshold when a node switches to the Unhealthy state by fitting various heavy-tail probability distributions. We consider using features of the node to customize the organic recovery model to the individual nodes that go unhealthy. This regression approach allows us to use information about the node like hardware, software versions, historical performance indicators,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed systems and fault tolerance
MethodsAffine Coupling · Normalizing Flows
