A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition
Thibault Ba\~neras-Roux, Mickael Rouvier, Jane Wottawa, Richard Dufour

TL;DR
This paper proposes a new paradigm that integrates existing metrics into a Minimum Edit Distance framework to better interpret errors in automatic speech recognition by aligning them with human perception.
Contribution
It introduces a novel approach that combines traditional metrics with human perception modeling to improve error interpretation in speech recognition evaluation.
Findings
The proposed paradigm aligns error severity with human perception.
It offers a more interpretable error measure than traditional WER and CER.
The approach facilitates studying the severity of transcription errors from a human perspective.
Abstract
The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While metric-based embeddings, seeking to approximate human perception, have been proposed, their scores remain difficult to interpret, unlike WER and CER. In this article, we overcome this problem by proposing a paradigm that consists in incorporating a chosen metric into it in order to obtain an equivalent of the error rate: a Minimum Edit Distance (minED). This approach parallels transcription errors with their human perception, also allowing an original study of the severity of these errors from a human perspective.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
