Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks
Sathvik Udupa, Prasanta Kumar Ghosh

TL;DR
This paper introduces a sequence-to-sequence transformer-based model with CVAE features to generate realistic real-time MRI videos from phoneme sequences, aiding speech production research.
Contribution
It presents a novel model combining transformers and CVAE for subject-specific rtMRI video synthesis from phonemes, improving realism and generalization.
Findings
Model generates realistic rtMRI videos for unseen utterances.
Adding CVAE improves learning in difficult subject-specific mappings.
Subject-specific training enhances synthesis accuracy.
Abstract
Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learning model with a transformer phoneme encoder and convolutional frame decoder. We then modify the learning by using intermediary features obtained from sampling from a pretrained phoneme-conditioned variational autoencoder (CVAE). We train on 8 subjects in a subject-specific manner and demonstrate the performance with a subjective test. We also use an auxiliary task of air tissue boundary (ATB) segmentation to obtain the objective scores on the proposed models. We show that the proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConditional Variational Auto Encoder
