Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal

TL;DR
This paper introduces JourneyBench, a new benchmark for evaluating large language model agents in customer support, focusing on policy adherence and realistic scenarios, and demonstrates that structured agents improve compliance.
Contribution
The paper presents JourneyBench, a novel benchmark with graph-based scenario generation and a new metric for policy adherence, enabling better evaluation of LLM agents in customer support tasks.
Findings
Dynamic-Prompt Agents outperform Static-Prompt Agents in policy adherence.
Smaller models like GPT-4o-mini can outperform larger models with proper structuring.
JourneyBench effectively measures policy compliance in diverse, realistic scenarios.
Abstract
Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAI in Service Interactions · Mobile Crowdsensing and Crowdsourcing · Speech and dialogue systems
