Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms
Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, Nadescha Trudel

TL;DR
This paper introduces a human-centered benchmarking framework for evaluating voice AI testing platforms, addressing the critical need for objective measurement of testing quality as voice AI scales globally.
Contribution
It presents the first systematic, statistically validated framework for assessing voice AI testing platforms' simulation and evaluation quality using human judgments.
Findings
Evalion outperforms competitors with 0.92 evaluation quality
Top platform achieves 0.61 simulation quality
Framework reveals significant performance differences among platforms
Abstract
Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
