SSDM 2.0: Time-Accurate Speech Rich Transcription with Non-Fluencies
Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David, Baquirin, Zachary Mille, Maria Luisa Gorno Tempini, Gopala Krishna, Anumanchipalli

TL;DR
SSDM 2.0 advances speech transcription by effectively capturing non-fluencies and dysfluencies using novel neural representations, a full-stack aligner, and in-context learning with large language models, outperforming previous models.
Contribution
The paper introduces SSDM 2.0 with a neural articulatory flow, a comprehensive subsequence aligner, in-context learning modules, and a large dysfluency corpus, addressing previous limitations in dysfluency transcription.
Findings
Outperforms previous dysfluency transcription models significantly.
Effectively captures all types of dysfluencies in speech.
Demonstrates strong results on clinical pathological speech datasets.
Abstract
Speech is a hierarchical collection of text, prosody, emotions, dysfluencies, etc. Automatic transcription of speech that goes beyond text (words) is an underexplored problem. We focus on transcribing speech along with non-fluencies (dysfluencies). The current state-of-the-art pipeline SSDM suffers from complex architecture design, training complexity, and significant shortcomings in the local sequence aligner, and it does not explore in-context learning capacity. In this work, we propose SSDM 2.0, which tackles those shortcomings via four main contributions: (1) We propose a novel \textit{neural articulatory flow} to derive highly scalable speech representations. (2) We developed a \textit{full-stack connectionist subsequence aligner} that captures all types of dysfluencies. (3) We introduced a mispronunciation prompt pipeline and consistency learning module into LLM to leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
MethodsFocus
