Augmenting Automatic Speech Recognition Models with Disfluency Detection

Robin Amann; Zhaolin Li; Barbara Bruno; Jan Niehues

arXiv:2409.10177·cs.CL·September 18, 2024

Augmenting Automatic Speech Recognition Models with Disfluency Detection

Robin Amann, Zhaolin Li, Barbara Bruno, Jan Niehues

PDF

Open Access

TL;DR

This paper introduces an inference-only method to enhance ASR models with disfluency detection, accurately locating and classifying disfluencies in speech without retraining the models.

Contribution

It proposes a novel CTC-based alignment algorithm and a classification model for disfluency detection that works with any ASR model without fine-tuning.

Findings

01

Captured 74.13% of missed disfluent words

02

Achieved 81.62% accuracy in gap classification

03

Demonstrated effective disfluency detection in spontaneous speech

Abstract

Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing