CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu; Guangquan Hu; Chenghuang Shen; Xingyan Liu; Jing Gu; Hangyi Sun; Junzhuo Ma; Weiting Liu; Jianfeng Liu; Mingyue Pu; Yu Wang; Zhengdong Xiao; Rui Xie; Longjiu Luo; Qianrong Wang; Gurong Cui; Honglin Qiao; Wenlian Lu

arXiv:2603.28569·cs.LG·March 31, 2026

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

PDF

1 Repo

TL;DR

CirrusBench is a new evaluation framework for LLM-based customer service agents, using real-world cloud support data to assess both reasoning and efficiency in complex, multi-turn interactions.

Contribution

It introduces a benchmark based on authentic cloud service tickets with novel customer-centric metrics for evaluating resolution efficiency.

Findings

01

State-of-the-art models excel in reasoning but struggle with complex multi-turn tasks.

02

Models often fail to meet efficiency standards necessary for real-world customer service.

03

CirrusBench highlights key areas for improving LLM-based agent performance in practical settings.

Abstract

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CirrusAI
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.