Augmenting Automatic Speech Recognition Models with Disfluency Detection
Robin Amann, Zhaolin Li, Barbara Bruno, Jan Niehues

TL;DR
This paper introduces an inference-only method to enhance ASR models with disfluency detection, accurately locating and classifying disfluencies in speech without retraining the models.
Contribution
It proposes a novel CTC-based alignment algorithm and a classification model for disfluency detection that works with any ASR model without fine-tuning.
Findings
Captured 74.13% of missed disfluent words
Achieved 81.62% accuracy in gap classification
Demonstrated effective disfluency detection in spontaneous speech
Abstract
Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
