Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of   the Vocal Tract during Speech

Hong Nguyen; Sean Foley; Kevin Huang; Xuan Shi; Tiantian Feng,; Shrikanth Narayanan

arXiv:2409.15525·eess.IV·September 25, 2024

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Hong Nguyen, Sean Foley, Kevin Huang, Xuan Shi, Tiantian Feng,, Shrikanth Narayanan

PDF

Open Access 1 Repo

TL;DR

This paper presents Speech2rtMRI, a diffusion model that generates real-time MRI videos of the vocal tract from speech input, aiding speech production visualization and related applications.

Contribution

It introduces a novel speech-to-video diffusion approach leveraging pre-trained speech models to generate MRI videos of the vocal tract from speech data.

Findings

01

Pre-trained speech representations improve visual generation quality.

02

Phoneme evaluation is easier within spoken word context.

03

Current limitations include tongue motion artifacts and video distortion.

Abstract

Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hong7cong/span-rtmri
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsDiffusion