Enhancing Failure Propagation Analysis in Cloud Computing Systems

Domenico Cotroneo; Luigi De Simone; Pietro Liguori; Roberto Natella,; Nematollah Bidokhti

arXiv:1908.11640·cs.SE·March 9, 2022

Enhancing Failure Propagation Analysis in Cloud Computing Systems

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella,, Nematollah Bidokhti

PDF

TL;DR

This paper introduces a new method combining fault injection and anomaly detection to improve failure analysis accuracy in cloud systems, specifically demonstrated on OpenStack, with low computational overhead.

Contribution

It presents a novel approach that enhances failure propagation analysis by integrating fault injection with anomaly detection, addressing challenges of complexity and non-determinism.

Findings

01

Significantly reduces false positives and negatives in failure detection

02

Demonstrates effectiveness on OpenStack cloud platform

03

Maintains low computational cost during analysis

Abstract

In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.