Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric
Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu, Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

TL;DR
This paper introduces SemDist, a semantic distance metric for evaluating speech recognition quality, which correlates better with user perception and downstream NLU tasks than traditional WER.
Contribution
The paper proposes SemDist, a novel semantic correctness metric for ASR evaluation that outperforms WER in correlating with user perception and NLU performance.
Findings
SemDist correlates more strongly with user perception than WER.
SemDist shows higher correlation with downstream NLU tasks.
Experimental results based on large user-annotated datasets support these claims.
Abstract
Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications. Word Error Rate (WER) has been traditionally used to evaluate ASR system quality; however, it sometimes correlates poorly with user perception/judgement of transcription quality. This is because WER weighs every word equally and does not consider semantic correctness which has a higher impact on user perception. In this work, we propose evaluating ASR output hypotheses quality with SemDist that can measure semantic correctness by using the distance between the semantic vectors of the reference and hypothesis extracted from a pre-trained language model. Our experimental results of 71K and 36K user annotated ASR output quality show that SemDist achieves higher correlation with user perception than WER. We also show that SemDist has higher correlation with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
