Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables
Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

TL;DR
This paper enhances speech inversion accuracy by integrating self-supervised speech representations and improved geometric transformation of tract variables, resulting in significant performance gains.
Contribution
It introduces the combined use of SSL-derived features and advanced geometric transformations for better speech inversion performance.
Findings
PPMC scores increased from 0.7452 to 0.8141
SSL models significantly improve feature representation
Enhanced geometric transformations boost TV estimation accuracy
Abstract
The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
