Challenges and Solutions to Build a Data Pipeline to Identify Anomalies   in Enterprise System Performance

Xiaobo Huang; Amitabha Banerjee; Chien-Chia Chen; Chengzhi Huang; Tzu; Yi Chuang; Abhishek Srivastava; Razvan Cheveresan

arXiv:2112.08940·cs.LG·December 17, 2021

Challenges and Solutions to Build a Data Pipeline to Identify Anomalies in Enterprise System Performance

Xiaobo Huang, Amitabha Banerjee, Chien-Chia Chen, Chengzhi Huang, Tzu, Yi Chuang, Abhishek Srivastava, Razvan Cheveresan

PDF

Open Access

TL;DR

This paper discusses how VMware addresses data challenges like label scarcity and data drift to improve the accuracy and stability of ML-based anomaly detection in enterprise data centers.

Contribution

The paper presents solutions to data challenges in deploying anomaly detection systems, resulting in a 30% accuracy improvement and sustained model performance over time.

Findings

01

30% increase in anomaly detection accuracy

02

Model performance remains stable over time

03

Successful deployment in production environment

Abstract

We discuss how VMware is solving the following challenges to harness data to operate our ML-based anomaly detection system to detect performance issues in our Software Defined Data Center (SDDC) enterprise deployments: (i) label scarcity and label bias due to heavy dependency on unscalable human annotators, and (ii) data drifts due to ever-changing workload patterns, software stack and underlying hardware. Our anomaly detection system has been deployed in production for many years and has successfully detected numerous major performance issues. We demonstrate that by addressing these data challenges, we not only improve the accuracy of our performance anomaly detection model by 30%, but also ensure that the model performance to never degrade over time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications · Network Security and Intrusion Detection