Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su; Cancan Li; Juan Liu; Wei Ju; Hongbin Suo; Ming Li

arXiv:2603.03811·cs.SD·March 5, 2026

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su, Cancan Li, Juan Liu, Wei Ju, Hongbin Suo, Ming Li

PDF

Open Access

TL;DR

This paper introduces AVUR-LLM, a novel approach for audio-visual speech recognition that uses sparse modality alignment and visual unit-guided refinement to improve robustness and performance, especially in noisy environments.

Contribution

It proposes a new LLM-based AVSR method with sparse modality alignment and visual unit-guided refinement, addressing limitations of prior shallow fusion approaches.

Findings

01

Achieves state-of-the-art AVSR results on LRS3 dataset.

02

Provides 37% relative improvement under 0 dB SNR noise conditions.

03

Demonstrates effective cross-modal alignment and robustness in adverse conditions.

Abstract

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing