MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement

Nan Xu; Zhaolong Huang; Xiaonan Zhi

arXiv:2505.13029·eess.AS·October 31, 2025·Interspeech

MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement

Nan Xu, Zhaolong Huang, Xiaonan Zhi

PDF

Open Access

TL;DR

The paper introduces MDDM, a novel multi-view diffusion-based model for speech enhancement that leverages features from time, frequency, and noise domains to improve speech quality with reduced computational cost.

Contribution

MDDM is the first to integrate multi-view features with a diffusion-based approach for speech enhancement, reducing sampling steps while maintaining performance.

Findings

01

MDDM outperforms existing methods on public and real-world datasets.

02

It achieves competitive speech enhancement with fewer sampling steps.

03

Subjective and objective metrics confirm its effectiveness.

Abstract

With the development of deep learning, speech enhancement has been greatly optimized in terms of speech quality. Previous methods typically focus on the discriminative supervised learning or generative modeling, which tends to introduce speech distortions or high computational cost. In this paper, we propose MDDM, a Multi-view Discriminative enhanced Diffusion-based Model. Specifically, we take the features of three domains (time, frequency and noise) as inputs of a discriminative prediction network, generating the preliminary spectrogram. Then, the discriminative output can be converted to clean speech by several inference sampling steps. Due to the intersection of the distributions between discriminative output and clean target, the smaller sampling steps can achieve the competitive performance compared to other diffusion-based methods. Experiments conducted on a public dataset and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Infant Health and Development · Speech Recognition and Synthesis

MethodsFocus