Enhancing the Analysis of Software Failures in Cloud Computing Systems   with Deep Learning

Domenico Cotroneo; Luigi De Simone; Pietro Liguori; Roberto Natella

arXiv:2106.15182·cs.AI·March 9, 2022

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella

PDF

1 Repo

TL;DR

This paper introduces a deep learning-based clustering method to analyze cloud system failures, reducing manual effort and improving the accuracy of failure mode identification in complex, noisy data.

Contribution

It presents a novel application of Deep Embedded Clustering to cloud failure data, eliminating the need for manual feature engineering and enhancing failure analysis.

Findings

01

Deep Embedded Clustering achieves comparable or better cluster purity than manual methods.

02

The approach reduces the need for domain expertise in failure analysis.

03

Failure mode distribution aligns more closely with actual frequencies.

Abstract

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dessertlab/Failure-Dataset-OpenStack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.