MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech   Recognition

He Wang; Pengcheng Guo; Pan Zhou; Lei Xie

arXiv:2401.03424·cs.SD·April 9, 2024·1 cites

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

PDF

Open Access

TL;DR

This paper introduces MLCA-AVSR, a multi-layer cross-attention fusion method for audio-visual speech recognition that enhances modality representation learning at multiple levels, leading to state-of-the-art results in noisy environments.

Contribution

It proposes a novel multi-layer cross-attention fusion approach that integrates audio and visual features at different encoder levels for improved speech recognition robustness.

Findings

01

Achieved a cpCER of 30.57% on MISP2022-AVSR Eval set.

02

Improved relative cpCER by 3.17% over previous system.

03

Established new state-of-the-art cpCER of 29.13% on the dataset.

Abstract

While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSparse Evolutionary Training · Focus