SAGE: A Service Agent Graph-guided Evaluation Benchmark

Ling Shi; Yuqin Dai; Ziyin Wang; Ning Gao; Wei Zhang; Chaozheng Wang; Yujie Wang; Wei He; Jinpeng Wang; Deiyi Xiong

arXiv:2604.09285·cs.AI·April 13, 2026

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, Yujie Wang, Wei He, Jinpeng Wang, Deiyi Xiong

PDF

1 Repo

TL;DR

SAGE introduces a multi-agent benchmark utilizing Dynamic Dialogue Graphs to evaluate LLMs' ability to follow structured SOPs and handle diverse user behaviors in customer service scenarios.

Contribution

It formalizes unstructured SOPs into Dynamic Dialogue Graphs and provides a modular framework for domain adaptation and automated dialogue synthesis.

Findings

01

Models excel at intent classification but struggle with correct subsequent actions.

02

High adversarial intensity reveals models' 'Empathy Resilience' despite logical failures.

03

Significant 'Execution Gap' observed across 27 LLMs in industrial scenarios.

Abstract

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/SAGE-Bench-4CD3
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.