Automatic Speech Recognition System-Independent Word Error Rate Estimation
Chanho Park, Mingjie Chen, Thomas Hain

TL;DR
This paper introduces a novel method for estimating Word Error Rate (WER) in speech recognition that is independent of specific ASR systems, using hypothesis generation to improve robustness across domains.
Contribution
It proposes a system-independent WER estimation approach that trains on simulated ASR outputs, outperforming baselines on out-of-domain data.
Findings
Achieves state-of-the-art performance on out-of-domain datasets.
Outperforms baseline estimators in RMSE and Pearson correlation.
Performance improves when training WER matches evaluation WER.
Abstract
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training
