Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders
Mostafa Sadeghi, Xavier Alameda-Pineda

TL;DR
This paper introduces a robust unsupervised audio-visual speech enhancement technique that dynamically switches between audio-only and audio-visual VAEs to handle noisy visual data, improving speech quality in challenging conditions.
Contribution
It proposes a novel mixture model combining trained audio-only and audio-visual VAEs with a variational EM algorithm for robust speech enhancement.
Findings
The method effectively skips noisy visual frames by switching models.
Experimental results demonstrate improved speech enhancement performance.
The approach outperforms existing models in noisy visual scenarios.
Abstract
Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729 · USD Coin Customer Service Number +1-833-534-1729
