SSDM 2.0: Time-Accurate Speech Rich Transcription with Non-Fluencies

Jiachen Lian; Xuanru Zhou; Zoe Ezzes; Jet Vonk; Brittany Morin; David; Baquirin; Zachary Mille; Maria Luisa Gorno Tempini; Gopala Krishna; Anumanchipalli

arXiv:2412.00265·eess.AS·December 3, 2024

SSDM 2.0: Time-Accurate Speech Rich Transcription with Non-Fluencies

Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David, Baquirin, Zachary Mille, Maria Luisa Gorno Tempini, Gopala Krishna, Anumanchipalli

PDF

Open Access

TL;DR

SSDM 2.0 advances speech transcription by effectively capturing non-fluencies and dysfluencies using novel neural representations, a full-stack aligner, and in-context learning with large language models, outperforming previous models.

Contribution

The paper introduces SSDM 2.0 with a neural articulatory flow, a comprehensive subsequence aligner, in-context learning modules, and a large dysfluency corpus, addressing previous limitations in dysfluency transcription.

Findings

01

Outperforms previous dysfluency transcription models significantly.

02

Effectively captures all types of dysfluencies in speech.

03

Demonstrates strong results on clinical pathological speech datasets.

Abstract

Speech is a hierarchical collection of text, prosody, emotions, dysfluencies, etc. Automatic transcription of speech that goes beyond text (words) is an underexplored problem. We focus on transcribing speech along with non-fluencies (dysfluencies). The current state-of-the-art pipeline SSDM suffers from complex architecture design, training complexity, and significant shortcomings in the local sequence aligner, and it does not explore in-context learning capacity. In this work, we propose SSDM 2.0, which tackles those shortcomings via four main contributions: (1) We propose a novel \textit{neural articulatory flow} to derive highly scalable speech representations. (2) We developed a \textit{full-stack connectionist subsequence aligner} that captures all types of dysfluencies. (3) We introduced a mispronunciation prompt pipeline and consistency learning module into LLM to leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques

MethodsFocus