Optimizing Waiting Thresholds Within A State Machine

Rohit Pandey; Yifan Chang; Cameron White; Gaurav Jagtiani; Aerin Young; Kim; Gil Lapid Shafriri; Sathya Singh

arXiv:1810.03278·cs.LG·October 9, 2018

Optimizing Waiting Thresholds Within A State Machine

Rohit Pandey, Yifan Chang, Cameron White, Gaurav Jagtiani, Aerin Young, Kim, Gil Lapid Shafriri, Sathya Singh

PDF

Open Access

TL;DR

This paper develops methods to optimize node recovery thresholds in cloud data centers, minimizing downtime by fitting probabilistic models and using numerical optimization for complex multi-threshold scenarios.

Contribution

It introduces a regression-based approach for customizing recovery thresholds based on node features and extends the optimization to multiple thresholds using numerical techniques.

Findings

01

Heavy-tail distributions effectively model node recovery times.

02

Node-specific features improve threshold optimization accuracy.

03

Gradient descent successfully optimizes multiple intertwined thresholds.

Abstract

Azure (the cloud service provided by Microsoft) is composed of physical computing units which are called nodes. These nodes are controlled by a software component called Fabric Controller (FC), which can consider the nodes to be in one of many different states such as Ready, Unhealthy, Booting, etc. Some of these states correspond to a node being unresponsive to FCs requests. When a node goes unresponsive for more than a set threshold, FC intervenes and reboots the node. We minimized the downtime caused by the intervention threshold when a node switches to the Unhealthy state by fitting various heavy-tail probability distributions. We consider using features of the node to customize the organic recovery model to the individual nodes that go unhealthy. This regression approach allows us to use information about the node like hardware, software versions, historical performance indicators,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed systems and fault tolerance

MethodsAffine Coupling · Normalizing Flows