AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
Qian Cheng, Doyen Sahoo, Amrita Saha, Wenzhuo Yang, Chenghao Liu,, Gerald Woo, Manpreet Singh, Silvio Saverese, Steven C. H. Hoi

TL;DR
This paper reviews the application of AI techniques in cloud-based IT operations, highlighting current trends, challenges, opportunities, and categorizing key tasks like incident detection and failure prediction.
Contribution
It provides a comprehensive taxonomy of AI methods for AIOps tasks, discusses data challenges, and identifies underexplored areas with potential for future AI advancements.
Findings
AIOps tasks include incident detection, failure prediction, root cause analysis, and automation.
Analysis of IT operational data reveals scale and complexity challenges.
Identification of underexplored topics that could benefit from AI research.
Abstract
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Big Data and Business Intelligence · Data Quality and Management
