Diffusion-based Unsupervised Audio-visual Speech Enhancement
Jean-Eudes Ayilo (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain, Serizel (MULTISPEECH), Xavier Alameda-Pineda (ROBOTLEARN)

TL;DR
This paper introduces an unsupervised audio-visual speech enhancement method using a diffusion generative model combined with NMF noise modeling, outperforming previous methods and offering improved inference efficiency.
Contribution
It presents a novel diffusion-based unsupervised AVSE approach that integrates video-conditioned speech generation with NMF noise modeling, enhancing performance and generalization.
Findings
Outperforms audio-only speech enhancement methods.
Generalizes better than recent supervised AVSE models.
Offers a faster inference algorithm with comparable or improved performance.
Abstract
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
