Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning
Shengkui Zhao, Zexu Pan, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

TL;DR
This paper introduces a conditional latent diffusion model with dual-context learning for speech enhancement, reducing complexity and improving generalization to unseen noise environments by operating in a low-dimensional latent space.
Contribution
It proposes a novel combination of a variational autoencoder and a conditional latent diffusion model with dual-context learning for more efficient and robust speech enhancement.
Findings
Outperforms existing diffusion-based methods in speech enhancement tasks.
Requires fewer iterative steps for effective denoising.
Shows superior generalization to out-of-domain noise datasets.
Abstract
Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
