Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals
Rahil Parikh, Nadee Seneviratne, Ganesh Sivaraman, Shihab Shamma,, Carol Espy-Wilson

TL;DR
This study explores the use of multi-resolution spectro-temporal features of speech signals to improve the estimation of articulatory features, demonstrating a correlation of 0.675 with ground-truth data using deep neural networks.
Contribution
It introduces a novel approach utilizing multi-resolution spectro-temporal features for speech inversion, outperforming traditional MFCC-based methods.
Findings
Achieved a correlation of 0.675 with ground-truth articulatory data.
Identified optimal scale and rate parameters for feature extraction.
Demonstrated the effectiveness of multi-resolution features over MFCCs.
Abstract
Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate how well the auditory cortex representation of speech signals contribute to estimate articulatory features of those corresponding signals. Since obtaining articulatory features from acoustic features of speech signals has been a challenging topic of interest for different speech communities, we investigate the possibility of using this multi-resolution representation of speech signals as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database of clean speech signals to train a feed-forward deep neural network (DNN) to estimate articulatory trajectories of six tract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
