Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
Fei Su, Cancan Li, Juan Liu, Wei Ju, Hongbin Suo, Ming Li

TL;DR
This paper introduces AVUR-LLM, a novel approach for audio-visual speech recognition that uses sparse modality alignment and visual unit-guided refinement to improve robustness and performance, especially in noisy environments.
Contribution
It proposes a new LLM-based AVSR method with sparse modality alignment and visual unit-guided refinement, addressing limitations of prior shallow fusion approaches.
Findings
Achieves state-of-the-art AVSR results on LRS3 dataset.
Provides 37% relative improvement under 0 dB SNR noise conditions.
Demonstrates effective cross-modal alignment and robustness in adverse conditions.
Abstract
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
