TL;DR
This paper introduces e-WER2, a novel multistream end-to-end approach for estimating word error rate (WER) in speech recognition without requiring transcriptions or access to the ASR system, enabling efficient performance evaluation.
Contribution
The paper presents a new no-box WER estimation method using joint acoustic-lexical features and a multistream architecture, extending WER estimation to systems without ASR access.
Findings
No-box system achieves 0.56 Pearson correlation with reference WER.
Estimated WER has 0.24 RMSE across 1,400 sentences.
e-WER2 estimates WER with reasonable accuracy without transcriptions.
Abstract
Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
