Improving Speech Inversion Through Self-Supervised Embeddings and   Enhanced Tract Variables

Ahmed Adel Attia; Yashish M. Siriwardena; Carol Espy-Wilson

arXiv:2309.09220·eess.AS·September 10, 2024

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

PDF

Open Access

TL;DR

This paper enhances speech inversion accuracy by integrating self-supervised speech representations and improved geometric transformation of tract variables, resulting in significant performance gains.

Contribution

It introduces the combined use of SSL-derived features and advanced geometric transformations for better speech inversion performance.

Findings

01

PPMC scores increased from 0.7452 to 0.8141

02

SSL models significantly improve feature representation

03

Enhanced geometric transformations boost TV estimation accuracy

Abstract

The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing