Learning robust speech representation with an articulatory-regularized variational autoencoder
Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber

TL;DR
This paper introduces an articulatory-regularized variational autoencoder that leverages articulatory parameters to improve speech representation learning, resulting in faster training, lower reconstruction loss, and enhanced speech denoising performance.
Contribution
It develops an articulatory model and integrates it into a VAE, demonstrating improved training efficiency and speech denoising compared to standard models.
Findings
Reduced training time and convergence loss
Enhanced speech denoising performance
Effective incorporation of articulatory features
Abstract
It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features. Then we incorporate these articulatory parameters into a variational autoencoder applied on spectral features by using a regularization technique that constraints part of the latent space to follow articulatory trajectories. We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
