Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

Xinkai Zou; Xuan Jiang; Ruikai Huang; Haoze He; Parv Kapoor; Hongrui Wu; Yibo Wang; Jian Sha; Xiongbo Shi; Zixun Huang; Jinhua Zhao

arXiv:2508.01844·cs.AI·October 7, 2025

Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Hongrui Wu, Yibo Wang, Jian Sha, Xiongbo Shi, Zixun Huang, Jinhua Zhao

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper introduces CloudAnoBench, a large-scale, challenging benchmark for context-aware anomaly detection in cloud environments, and proposes CloudAnoAgent, an LLM-based detection system that effectively integrates metrics and logs.

Contribution

The paper presents a novel benchmark combining metrics and logs for context anomalies and an LLM-based agent with symbolic verification for improved detection.

Findings

01

CloudAnoBench contains 28 anomalous scenarios and 16 normal scenarios with 200,000 entries.

02

Existing methods perform poorly on CloudAnoBench due to its difficulty and ambiguity.

03

CloudAnoAgent significantly improves detection accuracy and generalizes well to other datasets.

Abstract

Anomaly detection in cloud environments remains both critical and challenging. Existing context-level benchmarks typically focus on either metrics or logs and often lack reliable annotation, while most detection methods emphasize point anomalies within a single modality, overlooking contextual signals and limiting real-world applicability. Constructing a benchmark for context anomalies that combines metrics and logs is inherently difficult: reproducing anomalous scenarios on real servers is often infeasible or potentially harmful, while generating synthetic data introduces the additional challenge of maintaining cross-modal consistency. We introduce CloudAnoBench, a large-scale benchmark for context anomalies in cloud environments, comprising 28 anomalous scenarios and 16 deceptive normal scenarios, with 1,252 labeled cases and roughly 200,000 log and metric entries. Compared with prior…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The benchmark itself is a solid contribution. The focus on "context anomalies" that require both metrics and logs is well-motivated, and the inclusion of "deceptive normal scenarios" addresses a common and practical challenge in real-world operations where false alarms are costly. 2. The design of the CLOUDANOAGENT, which combines LLM-based reasoning with a symbolic verifier, is a strong point. This hybrid approach directly tackles the reliability issues often associated with LLMs, using a ru

Weaknesses

1. The benchmark is generated synthetically using LLMs. While this is a practical necessity for scenarios that are dangerous to reproduce, it raises questions about the data's realism and potential biases. An LLM-based evaluation model (CLOUDANOAGENT) might perform very well on data generated by another LLM, but it's unclear if this performance would hold on real-world, non-synthetic data. 2. The claim of "strong generalization" seems a bit of a stretch. The agent is tested on older, log-only da

Reviewer 02Rating 6Confidence 3

Strengths

+) A comprehensive and realistic benchmark that combines both metrics and logs +) Extensive experimental results showing significant improvements over LLM baselines

Weaknesses

-) The benchmark, while large, is still limited to the scenarios and data sources curated by the authors, which may not be fully covering the network anomalies -) It might be better to analyze failure cases

Reviewer 03Rating 4Confidence 4

Strengths

1. Positioning anomalies as contextual interactions between metrics and logs is well-motivated and distinct from point-anomaly setups. The benchmark includes deliberately deceptive normal scenarios to stress false-positive robustness. 2. The scenario taxonomy spans resource, network, software/app, malicious, and subtle cases; normal scenarios are enumerated with plausible operational events. 3. CloudAnoAgent improves F1 and reduces FPR over ML-based vanilla-LLM methods. It further shows compe

Weaknesses

1. The dataset generation relies heavily on LLMs. Though with human review for verification, the paper lacks quantitative evidence of how realistic the generated metric dynamics and log semantics are. 2. There is a lack of more recent DL-based and LLM-based anomaly detection baselines that are metric-only. The paper only evaluates ML-based metric-only methods. 3. The paper provides ablation with/without the symbolic verifier but does not isolate the contributions of Fast vs Slow stages or the

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Anomaly Detection Techniques and Applications · Advanced Malware Detection Techniques