Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding
Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen,, Ozlem Kalinli, Michael L. Seltzer

TL;DR
This paper introduces Semantic Distance (SemDist), a new metric based on sentence embeddings to better evaluate ASR systems for downstream language understanding tasks, addressing limitations of Word Error Rate.
Contribution
We propose SemDist, a semantic-based evaluation metric for ASR that leverages RoBERTa embeddings to better reflect semantic correctness relevant to downstream tasks.
Findings
SemDist correlates better with downstream task performance than WER.
SemDist improves evaluation accuracy for intent recognition and semantic parsing.
The metric effectively captures semantic errors overlooked by WER.
Abstract
Word Error Rate (WER) has been the predominant metric used to evaluate the performance of automatic speech recognition (ASR) systems. However, WER is sometimes not a good indicator for downstream Natural Language Understanding (NLU) tasks, such as intent recognition, slot filling, and semantic parsing in task-oriented dialog systems. This is because WER takes into consideration only literal correctness instead of semantic correctness, the latter of which is typically more important for these downstream tasks. In this study, we propose a novel Semantic Distance (SemDist) measure as an alternative evaluation metric for ASR systems to address this issue. We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space. To represent the reference and hypothesis as a sentence embedding, we exploit RoBERTa, a state-of-the-art pre-trained deep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Adam · Multi-Head Attention · Attention Dropout · Dense Connections · Softmax · Dropout
