Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

DongHoon Lim; YoungChae Kim; Dong-Hyun Kim; Da-Hee Yang; Joon-Hyuk Chang

arXiv:2508.18734·cs.CV·August 27, 2025

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang

PDF

TL;DR

This paper introduces a novel AVSR framework that adaptively combines audio and visual features using router-gated cross-modal fusion, significantly improving noise robustness and reducing word error rates in noisy environments.

Contribution

The paper proposes a new router-gated cross-modal feature fusion method that dynamically reweights audio and visual cues based on acoustic reliability, enhancing robustness in noisy conditions.

Findings

01

Achieves 16.51-42.67% relative WER reduction on LRS3

02

Effectively down-weights unreliable audio tokens in noisy settings

03

Both router and gating mechanisms are crucial for robustness

Abstract

Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.