Back to Basics: Revisiting ASR in the Age of Voice Agents

Geeyang Tay; Wentao Ma; Jaewon Lee; Yuzhi Tang; Daniel Lee; Weisu Yin; Dongming Shen; Silin Meng; Yi Zhu; Mu Li; Alex Smola

arXiv:2603.25727·cs.AI·March 27, 2026

Back to Basics: Revisiting ASR in the Age of Voice Agents

Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, Alex Smola

PDF

Open Access

TL;DR

This paper introduces WildASR, a diagnostic benchmark for evaluating multilingual ASR systems under real-world conditions, revealing significant robustness issues and safety risks in current models.

Contribution

The paper presents WildASR, a new multilingual benchmark that isolates factors affecting ASR robustness, along with analytical tools for deployment decision guidance.

Findings

01

Severe performance degradation across languages and conditions.

02

Model robustness does not transfer well between languages.

03

Models often hallucinate unspoken content, risking safety.

Abstract

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · AI in Service Interactions