TL;DR
SAGE introduces a multi-agent benchmark utilizing Dynamic Dialogue Graphs to evaluate LLMs' ability to follow structured SOPs and handle diverse user behaviors in customer service scenarios.
Contribution
It formalizes unstructured SOPs into Dynamic Dialogue Graphs and provides a modular framework for domain adaptation and automated dialogue synthesis.
Findings
Models excel at intent classification but struggle with correct subsequent actions.
High adversarial intensity reveals models' 'Empathy Resilience' despite logical failures.
Significant 'Execution Gap' observed across 27 LLMs in industrial scenarios.
Abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
