Predicting word error rate for reverberant speech
Hannes Gamper, Dimitra Emmanouilidou, Sebastian Braun, Ivan J. Tashev

TL;DR
This paper introduces methods to predict speech recognition error rates from acoustic parameters and reverberant speech samples, demonstrating improved accuracy over traditional measures and enabling blind estimation without detailed acoustic info.
Contribution
It proposes novel approaches for predicting WER directly from acoustic parameters and reverberant speech, including a CNN model for blind estimation, advancing ASR robustness assessment.
Findings
C50 and C80 correlate strongly with WER
Fitting approaches can predict WER accurately
CNN model outperforms parameter-based predictions
Abstract
Reverberation negatively impacts the performance of automatic speech recognition (ASR). Prior work on quantifying the effect of reverberation has shown that clarity (C50), a parameter that can be estimated from the acoustic impulse response, is correlated with ASR performance. In this paper we propose predicting ASR performance in terms of the word error rate (WER) directly from acoustic parameters via a polynomial, sigmoidal, or neural network fit, as well as blindly from reverberant speech samples using a convolutional neural network (CNN). We carry out experiments on two state-of-the-art ASR models and a large set of acoustic impulse responses (AIRs). The results confirm C50 and C80 to be highly correlated with WER, allowing WER to be predicted with the proposed fitting approaches. The proposed non-intrusive CNN model outperforms C50-based WER prediction, indicating that WER can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
