MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition
He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

TL;DR
This paper introduces MLCA-AVSR, a multi-layer cross-attention fusion method for audio-visual speech recognition that enhances modality representation learning at multiple levels, leading to state-of-the-art results in noisy environments.
Contribution
It proposes a novel multi-layer cross-attention fusion approach that integrates audio and visual features at different encoder levels for improved speech recognition robustness.
Findings
Achieved a cpCER of 30.57% on MISP2022-AVSR Eval set.
Improved relative cpCER by 3.17% over previous system.
Established new state-of-the-art cpCER of 29.13% on the dataset.
Abstract
While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsSparse Evolutionary Training · Focus
