Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract
Tam\'as G\'abor Csap\'o

TL;DR
This paper explores speaker-dependent speech prediction from real-time MRI of the vocal tract using deep neural networks, demonstrating the effectiveness of CNN-LSTM models and highlighting the impact of data synchronization issues.
Contribution
It introduces the novel use of rtMRI for articulatory-to-speech mapping and compares various neural network architectures for this task.
Findings
CNN-LSTM networks outperform other models in speech prediction accuracy.
RTMRI provides detailed articulatory data including velum and pharyngeal regions.
Synchronization issues significantly affect prediction quality, as shown in speaker m1's results.
Abstract
Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high `relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Phonetics and Phonology Research · Speech Recognition and Synthesis
