TL;DR
This paper demonstrates how a variational autoencoder trained on speech data can naturally learn orthogonal subspaces corresponding to source-filter speech components, enabling independent control and analysis of speech features like pitch and formants.
Contribution
The work shows that source-filter speech representations emerge as orthogonal subspaces in a VAE's latent space, allowing for unsupervised learning and independent manipulation of speech features.
Findings
Latent subspaces for $f_0$ and formants are orthogonal.
Proposed method accurately controls speech features without labeled data.
Introduced a robust $f_0$ estimation technique using latent space projections.
Abstract
Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
