VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series

Pengyu Xu; Shijia Li; Ao Sun; Feng Zhang; Yahan Li; Bo Wu; Zhanyu Ma; Jiguo Li; Jun Xu; Jiuchong Gao; Jinghua Hao; Renqing He; Rui Wang; Yang Liu; Xiaobo Hu; Fan Yang; Jia Zheng; Guanghua Yao

arXiv:2510.21244·cs.AI·November 17, 2025

VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series

Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao

PDF

Open Access

TL;DR

OutboundEval is a comprehensive benchmark designed to evaluate large language models in expert-level outbound calling scenarios across diverse domains, addressing dataset diversity, realistic user simulation, and evaluation accuracy.

Contribution

The paper introduces OutboundEval, a structured, domain-oriented benchmark with a large-model-driven user simulator and dynamic evaluation methods for expert-level outbound AI systems.

Findings

01

12 state-of-the-art LLMs evaluated, revealing trade-offs between task completion and fluency.

02

Benchmark covers six business domains and 30 sub-scenarios with domain-adaptive metrics.

03

Experiments demonstrate the effectiveness of the proposed evaluation framework.

Abstract

We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Topic Modeling · Speech and dialogue systems