A Roadmap towards Intelligent Operations for Reliable Cloud Computing Systems
Yintong Huo, Cheryl Lee, Jinyang Liu, Tianyi Yang, and Michael R. Lyu

TL;DR
This paper proposes a data-driven AIOps approach to improve the reliability of cloud microservices by addressing internal and external challenges through ticket management, log analysis, multimodal analysis, and resilience testing.
Contribution
It introduces a comprehensive data-driven framework for enhancing cloud microservice reliability, integrating multiple analysis techniques and resilience testing.
Findings
Significant improvement in system reliability with the proposed approach
Effective handling of internal and external reliability challenges
Enhanced microservice resilience through multimodal analysis
Abstract
The increasing complexity and usage of cloud systems have made it challenging for service providers to ensure reliability. This paper highlights two main challenges, namely internal and external factors, that affect the reliability of cloud microservices. Afterward, we discuss the data-driven approach that can resolve these challenges from four key aspects: ticket management, log management, multimodal analysis, and the microservice resilience testing approach. The experiments conducted show that the proposed data-driven AIOps solution significantly enhances system reliability from multiple angles.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
