Annual Interruption Rate as a KPI, its measurement and comparison
Rohit Pandey, Yingnong Dang, Ali Vira, Aerin Kim, Gil Lapid Shafriri,, Murali Chintalapati

TL;DR
This paper explores the failure rate as a KPI, specifically the Annual Interruption Rate in Azure, and discusses methods for measuring, comparing, and statistically analyzing this metric to detect regressions and system changes.
Contribution
It introduces a comprehensive approach to measuring and comparing failure rates using statistical hypothesis testing, with practical applications in system monitoring and regression detection.
Findings
Failure rate can be effectively modeled and measured from logs.
Statistical hypothesis tests can detect regressions in failure rate.
Practical guidelines for system change analysis and validation.
Abstract
This article is divided into two chapters. The first chapter describes the failure rate as a KPI and studies its properties. The second one goes over ways to compare this KPI across two groups using the concepts of statistical hypothesis testing. In section 1., we will motivate the failure rate as a KPI (in Azure, it is dubbed `Annual Interruption Rate' or AIR. In section 3, we will discuss measuring failure rate from logs machines typically generate. In section 1.2, we will discuss the problem of measuring it from real-world data. In section 2.1, we will discuss the general concepts of hypothesis testing. In section 2.2, we will go over some general count distributions for modeling Azure reboots. In section 2.3, we will go over some experiments on applying various hypothesis tests to simulated data. In section 2.4, we will discuss some applications of this work like using these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Advanced Database Systems and Queries · Big Data and Business Intelligence
