Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised   Learning with Targeted Fine-Tuning and Data Augmentation

Dena Mujtaba; Nihar R. Mahapatra; Megan Arney; J. Scott Yaruss; Caryn; Herring; Jia Bin

arXiv:2406.10177·eess.AS·October 3, 2024

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn, Herring, Jia Bin

PDF

Open Access

TL;DR

This paper introduces an inclusive ASR approach that combines large-scale self-supervised learning, targeted fine-tuning, and data augmentation to improve recognition of disfluent speech, especially for people who stutter.

Contribution

It presents a novel method integrating self-supervised learning with targeted fine-tuning and data augmentation to enhance ASR performance on disfluent speech datasets.

Findings

01

Significant reduction in word error rates for disfluent speech.

02

Effective data augmentation techniques for disfluency diversity.

03

Improved ASR inclusivity for speech with stuttering.

Abstract

Automatic speech recognition (ASR) systems often falter while processing stuttering-related disfluencies -- such as involuntary blocks and word repetitions -- yielding inaccurate transcripts. A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. Therefore, we present an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech. Our data augmentation technique enriches training datasets with various disfluencies, enhancing ASR processing of these speech patterns. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech. Our approach not only advances ASR inclusivity for people who…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing