AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems
Tianyi Yang, Jiacheng Shen, Yuxin Su, Xiao Ling, Yongqiang Yang,, Michael R. Lyu

TL;DR
This paper introduces AID, a novel method for efficiently predicting the degree of dependency between cloud services by analyzing their status similarities, which helps improve failure diagnosis and system reliability.
Contribution
AID is the first approach to predict the intensity of dependencies between cloud services using multivariate time series and similarity aggregation, enhancing failure impact analysis.
Findings
AID accurately predicts dependency intensities in cloud systems.
AID demonstrates efficiency in large-scale cloud environments.
Experimental results confirm the effectiveness of AID in real-world scenarios.
Abstract
Service reliability is one of the key challenges that cloud providers have to deal with. In cloud systems, unplanned service failures may cause severe cascading impacts on their dependent services, deteriorating customer satisfaction. Predicting the cascading impacts accurately and efficiently is critical to the operation and maintenance of cloud systems. Existing approaches identify whether one service depends on another via distributed tracing but no prior work focused on discriminating to what extent the dependency between cloud services is. In this paper, we survey the outages and the procedure for failure diagnosis in two cloud providers to motivate the definition of the intensity of dependency. We define the intensity of dependency between two services as how much the status of the callee service influences the caller service. Then we propose AID, the first approach to predict the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Anomaly Detection Techniques and Applications
