Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

Miguel E. Andres; Vadim Fedorov; Rida Sadek; Enric Spagnolo-Arrizabalaga; Nadescha Trudel

arXiv:2511.04133·cs.AI·January 15, 2026

Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, Nadescha Trudel

PDF

Open Access

TL;DR

This paper introduces a human-centered benchmarking framework for evaluating voice AI testing platforms, addressing the critical need for objective measurement of testing quality as voice AI scales globally.

Contribution

It presents the first systematic, statistically validated framework for assessing voice AI testing platforms' simulation and evaluation quality using human judgments.

Findings

01

Evalion outperforms competitors with 0.92 evaluation quality

02

Top platform achieves 0.61 simulation quality

03

Framework reveals significant performance differences among platforms

Abstract

Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education