LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Rodrigo Mira; Buye Xu; Jacob Donley; Anurag Kumar; Stavros Petridis,; Vamsi Krishna Ithapu; Maja Pantic

arXiv:2211.10999·cs.SD·March 14, 2023·1 cites

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis,, Vamsi Krishna Ithapu, Maja Pantic

PDF

Open Access

TL;DR

LA-VocE introduces a novel two-stage audio-visual speech enhancement method using transformers and neural vocoders, significantly improving speech quality in noisy environments across diverse speakers and languages.

Contribution

It is the first to combine transformer-based mel-spectrogram prediction with neural vocoders for low-SNR audio-visual speech enhancement.

Findings

01

Outperforms existing methods on multiple metrics

02

Effective in very noisy scenarios

03

Generalizes across speakers and languages

Abstract

Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Face recognition and analysis