Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
Pooja Srinivas, Fiza Husain, Anjaly Parayil, Ayush Choure, Chetan, Bansal, Saravan Rajmohan

TL;DR
This paper presents a data-driven, deep learning framework for recommending cloud service monitors, improving coverage and reducing redundancy by leveraging a large dataset and structured ontology, validated through user studies at Microsoft.
Contribution
It introduces an ontology-based, deep learning approach for automated monitor recommendation in cloud services, addressing ad-hoc and incomplete monitoring practices.
Findings
Achieved high-quality monitor recommendations for most resource classes.
User study rated framework usefulness as 4.27 out of 5.
Derived a structured monitor ontology from over 30,000 monitors.
Abstract
Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort. In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Big Data and Business Intelligence
Methodstravel james · Focus · Ontology
