Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection
Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye,, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary, Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

TL;DR
This paper introduces a novel token-based approach to speech dysfluency detection, framing it as a sequence-to-sequence recognition task, and provides a new benchmark with open-source tools for future research.
Contribution
It proposes a tokenization-based detection method, develops simulators and a new benchmark, and systematically compares token-based and time-based approaches.
Findings
Token-based methods outperform time-based methods in certain scenarios.
The new benchmark facilitates standardized evaluation of dysfluency detection.
Open-source resources support future research in speech dysfluency modeling.
Abstract
Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Stuttering Research and Treatment
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
