Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li

TL;DR
This paper introduces a novel audio-visual speech enhancement network that employs multi-head cross attention to better fuse and balance audio-visual features, leading to improved speech enhancement performance.
Contribution
It proposes a multi-layer feature fusion model with a 2-stage multi-head cross attention mechanism for more effective audio-visual feature integration.
Findings
Outperforms state-of-the-art models in speech enhancement tasks.
Effective balancing of audio-visual features improves speech clarity.
Layer-by-layer fusion enhances feature utilization.
Abstract
Audio-visual speech enhancement system is regarded as one of promising solutions for isolating and enhancing speech of desired speaker. Typical methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) are not adequate to use data fully, b) are unable to effectively balance audio-visual features. The proposed model alleviates these drawbacks by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing a 2-stage multi-head cross attention (MHCA) mechanism to infer audio-visual speech enhancement for balancing the fused audio-visual features and eliminating irrelevant features. This paper proposes attentional audio-visual multi-layer feature fusion model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Image and Signal Denoising Methods
MethodsConvolution
