Diffusion-based Unsupervised Audio-visual Speech Enhancement

Jean-Eudes Ayilo (MULTISPEECH); Mostafa Sadeghi (MULTISPEECH); Romain; Serizel (MULTISPEECH); Xavier Alameda-Pineda (ROBOTLEARN)

arXiv:2410.05301·cs.SD·January 16, 2025

Diffusion-based Unsupervised Audio-visual Speech Enhancement

Jean-Eudes Ayilo (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain, Serizel (MULTISPEECH), Xavier Alameda-Pineda (ROBOTLEARN)

PDF

TL;DR

This paper introduces an unsupervised audio-visual speech enhancement method using a diffusion generative model combined with NMF noise modeling, outperforming previous methods and offering improved inference efficiency.

Contribution

It presents a novel diffusion-based unsupervised AVSE approach that integrates video-conditioned speech generation with NMF noise modeling, enhancing performance and generalization.

Findings

01

Outperforms audio-only speech enhancement methods.

02

Generalizes better than recent supervised AVSE models.

03

Offers a faster inference algorithm with comparable or improved performance.

Abstract

This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion