ANSC: Probabilistic Capacity Health Scoring for Datacenter-Scale Reliability
Madhava Gaikwad, Abhishek Gandhi

TL;DR
ANSC introduces a probabilistic scoring framework for datacenter capacity health, enabling operators to proactively identify and prioritize imminent capacity risks across large-scale infrastructures.
Contribution
The paper presents a novel probabilistic capacity health scoring system that considers both current capacity and failure probabilities, improving risk assessment in hyperscale datacenters.
Findings
Enables prioritization of remediation across 400+ datacenters
Reduces alert noise and false positives
Aligns SRE focus on critical capacity risks
Abstract
We present ANSC, a probabilistic capacity health scoring framework for hyperscale datacenter fabrics. While existing alerting systems detect individual device or link failures, they do not capture the aggregate risk of cascading capacity shortfalls. ANSC provides a color-coded scoring system that indicates the urgency of issues \emph{not solely by current impact, but by the probability of imminent capacity violations}. Our system accounts for both current residual capacity and the probability of additional failures, normalized at datacenter and regional level. We demonstrate that ANSC enables operators to prioritize remediation across more than 400 datacenters and 60 regions, reducing noise and aligning SRE focus on the most critical risks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
