Improving Visual Speech Enhancement Network by Learning Audio-visual   Affinity with Multi-head Attention

Xinmeng Xu; Yang Wang; Jie Jia; Binbin Chen; Dejun Li

arXiv:2206.14964·eess.AS·July 1, 2022·1 cites

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual speech enhancement network that employs multi-head cross attention to better fuse and balance audio-visual features, leading to improved speech enhancement performance.

Contribution

It proposes a multi-layer feature fusion model with a 2-stage multi-head cross attention mechanism for more effective audio-visual feature integration.

Findings

01

Outperforms state-of-the-art models in speech enhancement tasks.

02

Effective balancing of audio-visual features improves speech clarity.

03

Layer-by-layer fusion enhances feature utilization.

Abstract

Audio-visual speech enhancement system is regarded as one of promising solutions for isolating and enhancing speech of desired speaker. Typical methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) are not adequate to use data fully, b) are unable to effectively balance audio-visual features. The proposed model alleviates these drawbacks by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing a 2-stage multi-head cross attention (MHCA) mechanism to infer audio-visual speech enhancement for balancing the fused audio-visual features and eliminating irrelevant features. This paper proposes attentional audio-visual multi-layer feature fusion model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Image and Signal Denoising Methods

MethodsConvolution