Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement
Jiatong Li, Simon Doclo

TL;DR
This paper explores how different latent representations in VAE-based speech enhancement systems impact performance, demonstrating that well-separated speech and noise representations significantly improve enhancement quality.
Contribution
It investigates the effect of various latent space configurations on speech enhancement, highlighting the importance of clear separation between speech and noise representations.
Findings
Separated latent representations improve speech enhancement performance
Modifying VAE loss terms influences latent space quality
Experiments show significant gains over standard VAEs
Abstract
Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
