Alioth: A Machine Learning Based Interference-Aware Performance Monitor for Multi-Tenancy Applications in Public Cloud
Tianyao Shi, Yingxuan Yang, Yunlong Cheng, Xiaofeng Gao, Zhen Fang,, Yongqiang Yang

TL;DR
Alioth is a machine learning framework that detects performance degradation in multi-tenant cloud applications by analyzing low-level metrics, using transfer learning and interpretability techniques to improve accuracy and usability.
Contribution
The paper introduces Alioth, a novel ML-based system that monitors interference-induced performance issues in cloud environments using transfer learning and feature interpretability.
Findings
Alioth achieves 5.29% MAE offline and 10.8% on unseen applications.
It outperforms baseline methods in detecting performance degradation.
Alioth demonstrates robustness under dynamic cloud conditions.
Abstract
Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black-boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters. We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
MethodsFeature Selection · Shapley Additive Explanations
