Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement
Danilo de Oliveira, Tal Peer, Timo Gerkmann

TL;DR
This study examines how modern automatic speech recognition models correlate with human perception in evaluating speech enhancement, highlighting the influence of model complexity and training on evaluation reliability.
Contribution
It provides an analysis of the correlation between advanced ASR models and human recognition, revealing limitations in using these models for speech enhancement evaluation.
Findings
Modern ASR models with large-scale noisy training correlate better with human WER.
Transducer models offer the most reliable transcriptions among tested models.
Robustness to noise and context in ASR models can be uninformative for acoustic-focused evaluation.
Abstract
Speech enhancement (SE) systems are typically evaluated using a variety of instrumental metrics. The use of automatic speech recognition (ASR) systems to evaluate SE performance is common in literature, usually in terms of word error rate (WER). However, WER scores depend heavily on the choice of ASR system and text normalization pipeline. In this paper, we investigate how modern ASR models correlate with human recognition of enhanced speech. A listening experiment reveals that modern ASR models with large-scale noisy training and embedded language models correlate more with human WER than simpler ones, with a transducer model providing the most reliable transcriptions. Nevertheless, we also show that these models' robustness to noise and use of context can be uninformative to an acoustics-focused evaluation of enhancement performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
