Decoding Vocal Articulations from Acoustic Latent Representations
Mateo C\'amara, Fernando Marcos, Jos\'e Luis Blanco

TL;DR
This paper introduces a neural encoder system that decodes articulatory features from acoustic representations using pretrained models and a voice synthesizer, enabling faster and more efficient acoustic-to-articulatory inversion.
Contribution
It presents a novel approach combining pretrained models and a neural encoder to improve acoustic-to-articulatory inversion efficiency and accuracy.
Findings
Predicted parameters produce human-like vowel sounds.
The system effectively captures articulatory features from acoustic data.
Using pretrained models reduces computational overhead.
Abstract
We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizer's parameters. The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsFocus
