Decoding Vocal Articulations from Acoustic Latent Representations

Mateo C\'amara; Fernando Marcos; Jos\'e Luis Blanco

arXiv:2406.14379·eess.AS·June 21, 2024·1 cites

Decoding Vocal Articulations from Acoustic Latent Representations

Mateo C\'amara, Fernando Marcos, Jos\'e Luis Blanco

PDF

Open Access

TL;DR

This paper introduces a neural encoder system that decodes articulatory features from acoustic representations using pretrained models and a voice synthesizer, enabling faster and more efficient acoustic-to-articulatory inversion.

Contribution

It presents a novel approach combining pretrained models and a neural encoder to improve acoustic-to-articulatory inversion efficiency and accuracy.

Findings

01

Predicted parameters produce human-like vowel sounds.

02

The system effectively captures articulatory features from acoustic data.

03

Using pretrained models reduces computational overhead.

Abstract

We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizer's parameters. The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus