Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Sumanth Balaji; Piyush Mishra; Aashraya Sachdeva; Suraj Agrawal

arXiv:2601.00596·cs.CL·January 5, 2026

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal

PDF

Open Access 1 Video

TL;DR

This paper introduces JourneyBench, a new benchmark for evaluating large language model agents in customer support, focusing on policy adherence and realistic scenarios, and demonstrates that structured agents improve compliance.

Contribution

The paper presents JourneyBench, a novel benchmark with graph-based scenario generation and a new metric for policy adherence, enabling better evaluation of LLM agents in customer support tasks.

Findings

01

Dynamic-Prompt Agents outperform Static-Prompt Agents in policy adherence.

02

Smaller models like GPT-4o-mini can outperform larger models with proper structuring.

03

JourneyBench effectively measures policy compliance in diverse, realistic scenarios.

Abstract

Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence· underline

Taxonomy

TopicsAI in Service Interactions · Mobile Crowdsensing and Crowdsourcing · Speech and dialogue systems