Synthesizing audio from tongue motion during speech using tagged MRI via transformer
Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Maureen Stone, Georges El, Fakhri, Jonghye Woo

TL;DR
This paper introduces a transformer-based encoder-decoder network that synthesizes speech audio from 4D tongue motion data captured via tagged MRI, advancing understanding of speech motor control and aiding speech disorder treatments.
Contribution
It presents a novel deep learning framework combining 3D convolution, transformer, and GAN techniques to convert tongue motion into intelligible speech audio.
Findings
Successfully generated clear speech waveforms from tongue motion data.
Demonstrated the model's potential to improve understanding of speech motor control.
Enhanced speech synthesis quality with adversarial training.
Abstract
Investigating the relationship between internal tissue point motion of the tongue and oropharyngeal muscle deformation measured from tagged MRI and intelligible speech can aid in advancing speech motor control theories and developing novel treatment methods for speech related-disorders. However, elucidating the relationship between these two sources of information is challenging, due in part to the disparity in data structure between spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio waveforms. In this work, we present an efficient encoder-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder is based on 3D convolutional spatial modeling and transformer-based temporal modeling. The extracted features are processed by an asymmetric 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
MethodsConvolution
