Cluster-Wide Task Slowdown Detection in Cloud System
Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang,, Qingsong Wen, Shuiguang Deng

TL;DR
This paper introduces SORN, a novel method for detecting cluster-wide task slowdowns in cloud systems by analyzing task duration distributions, addressing limitations of existing single-task anomaly detection approaches.
Contribution
The paper proposes SORN, combining a Skimming Attention mechanism and Neural Optimal Transport, to effectively detect cluster-wide slowdowns amidst fluctuations, with a new adaptive loss function for training.
Findings
SORN outperforms existing methods on real-world datasets.
The Skimming Attention effectively reconstructs compound periodicity.
The adaptive loss improves training robustness.
Abstract
Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Software System Performance and Reliability
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
