Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Yudong Yang; Zhan Liu; Wenyi Yu; Guangzhi Sun; Qiuqiang Kong; Chao Zhang

arXiv:2409.09642·eess.AS·September 23, 2025

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Yudong Yang, Zhan Liu, Wenyi Yu, Guangzhi Sun, Qiuqiang Kong, Chao Zhang

PDF

Open Access

TL;DR

This paper introduces Ex-Diff, a novel diffusion model that combines generative and discriminative approaches to enhance speech and vocals, achieving notable improvements in audio quality metrics.

Contribution

The paper presents Ex-Diff, integrating discriminative latent representations into diffusion models for superior speech and vocal enhancement.

Findings

01

3.7% relative improvement in SI-SDR

02

10.0% relative improvement in SI-SIR

03

Demonstrates the complementary strengths of generative and discriminative models

Abstract

Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may not achieve the same level of fidelity as the discriminative models specifically trained to enhance particular acoustic conditions. In this paper, we propose Ex-Diff, a novel score-based diffusion model that integrates the latent representations produced by a discriminative model to improve speech and vocal enhancement, which combines the strengths of both generative and discriminative models. Experimental results on the widely used MUSDB dataset show relative improvements of 3.7% in SI-SDR and 10.0% in SI-SIR compared to the baseline diffusion model for speech and vocal enhancement tasks, respectively. Additionally, case studies are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsDiffusion