Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation
Seongmin Park, Dongchan Shin, Sangyoun Paik, Subong Choi, Alena, Kazakova, Jihwa Lee

TL;DR
This paper introduces a feature space interpolation method to enhance ASR error detection by reducing confusion caused by speech disfluencies, improving detection accuracy across multiple languages and systems.
Contribution
It proposes a novel mixup-based approach in feature space to improve error detection and robustness of language models against disfluencies in ASR post-processing.
Findings
Improves ASR error detection F1 scores
Reduces false positives on disfluencies
Effective across multiple languages and ASR systems
Abstract
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by LMs, at times the pretrained knowledge can hinder error detection performance. For instance, presence of speech disfluencies might confuse the post-processing system into tagging disfluent but accurate transcriptions as ASR errors. Such confusion occurs because both error detection and disfluency detection tasks attempt to identify tokens at statistically unlikely positions. This paper proposes a scheme to improve existing LM-based ASR error detection systems, both in terms of detection scores and resilience to such distracting auxiliary tasks. Our approach adopts the popular mixup method in text feature space and can be utilized with any black-box…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsMixup
