Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes
Sofiane Azzouz, Pierre-Andr\'e Vuissoz, Yves Laprie

TL;DR
This study demonstrates the feasibility of complete vocal tract inversion from RT-MRI data using a Bi-LSTM model with various audio embeddings and dataset sizes, achieving high accuracy in articulatory reconstruction.
Contribution
It introduces a novel approach using articulator contours from RT-MRI for full vocal tract inversion, moving beyond traditional EMA-based methods.
Findings
Average RMSE of 1.48 mm, close to pixel size of 1.62 mm.
Impact of different audio embeddings evaluated, with HuBERT showing promising results.
Dataset size from 10 minutes to 3.5 hours influences inversion accuracy.
Abstract
Articulatory-to-acoustic inversion strongly depends on the type of data used. While most previous studies rely on EMA, which is limited by the number of sensors and restricted to accessible articulators, we propose an approach aiming at a complete inversion of the vocal tract, from the glottis to the lips. To this end, we used approximately 3.5 hours of RT-MRI data from a single speaker. The innovation of our approach lies in the use of articulator contours automatically extracted from MRI images, rather than relying on the raw images themselves. By focusing on these contours, the model prioritizes the essential geometric dynamics of the vocal tract while discarding redundant pixel-level information. These contours, alongside denoised audio, were then processed using a Bi-LSTM architecture. Two experiments were conducted: (1) the analysis of the impact of the audio embedding, for which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
