EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli; Gabrielle Gauthier Melan\c{c}on; Katrina Stankiewicz; Oluwanifemi Bamgbose; Fanny Riols; Hoang H. Nguyen; Raghav Mehndiratta; Lindsay Devon Brin; Joseph Marinier; Hari Subramani; Anil Madamala; Sridhar Krishna Nemala; Srinivas Sunkara

arXiv:2605.13841·cs.SD·May 14, 2026

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli, Gabrielle Gauthier Melan\c{c}on, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

PDF

1 Repo 1 Datasets

TL;DR

EVA-Bench is an open-source end-to-end framework for evaluating voice agents, addressing realistic conversation simulation and comprehensive quality measurement across multiple architectures and robustness scenarios.

Contribution

It introduces a novel benchmark with composite metrics, diverse scenarios, and robustness tests, enabling comprehensive and cross-architecture evaluation of voice agents.

Findings

01

No system exceeds 0.5 on both EVA-A and EVA-X pass@1 metrics.

02

Peak and reliable performances differ significantly across systems.

03

Robustness to accent and noise perturbations varies widely among architectures.

Abstract

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

servicenow/eva
github

Datasets

ServiceNow-AI/eva
dataset· 254 dl
254 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.