AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised   Features for Audio-Visual Speech Enhancement

Ju-Chieh Chou; Chung-Ming Chien; Karen Livescu

arXiv:2309.08030·eess.AS·November 5, 2024

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu

PDF

Open Access

TL;DR

AV2Wav introduces a diffusion-based method for audio-visual speech enhancement that generates clean speech from noisy, real-world data using continuous speech representations, outperforming traditional masking approaches.

Contribution

The paper presents a novel diffusion model trained on nearly clean speech to enhance noisy audio-visual speech, leveraging continuous representations for improved quality.

Findings

01

Outperforms masking-based baseline in automatic metrics

02

Achieves near-target speech quality in listening tests

03

Effective in real-world noisy environments

Abstract

Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques

MethodsDiffusion