Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou; Jiachen Lian; Cheol Jun Cho; Jingwen Liu; Zongli Ye,; Jinming Zhang; Brittany Morin; David Baquirin; Jet Vonk; Zoe Ezzes; Zachary; Miller; Maria Luisa Gorno Tempini; Gopala Anumanchipalli

arXiv:2409.13582·eess.AS·September 23, 2024

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye,, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary, Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

PDF

Open Access

TL;DR

This paper introduces a novel token-based approach to speech dysfluency detection, framing it as a sequence-to-sequence recognition task, and provides a new benchmark with open-source tools for future research.

Contribution

It proposes a tokenization-based detection method, develops simulators and a new benchmark, and systematically compares token-based and time-based approaches.

Findings

01

Token-based methods outperform time-based methods in certain scenarios.

02

The new benchmark facilitates standardized evaluation of dysfluency detection.

03

Open-source resources support future research in speech dysfluency modeling.

Abstract

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Stuttering Research and Treatment

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence