YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Xuanru Zhou; Anshul Kashyap; Steve Li; Ayati Sharma; Brittany Morin,; David Baquirin; Jet Vonk; Zoe Ezzes; Zachary Miller; Maria Luisa Gorno; Tempini; Jiachen Lian; Gopala Krishna Anumanchipalli

arXiv:2408.15297·eess.AS·September 17, 2024

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin,, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno, Tempini, Jiachen Lian, Gopala Krishna Anumanchipalli

PDF

Open Access 1 Repo

TL;DR

YOLO-Stutter is an innovative end-to-end model for precise, region-wise detection of speech dysfluencies, improving robustness and efficiency over traditional rule-based systems, and demonstrating state-of-the-art results on simulated and real aphasia speech datasets.

Contribution

It introduces YOLO-Stutter, the first end-to-end approach for region-wise speech dysfluency detection, with new dysfluency corpora and superior performance.

Findings

01

Achieves state-of-the-art accuracy on simulated and real aphasia speech.

02

Uses fewer trainable parameters than existing models.

03

Effectively detects various dysfluency types including repetition and prolongation.

Abstract

Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rorizzz/yolo-stutter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Voice and Speech Disorders