Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder

Zhengyang Li; Thomas Graave; Bj\"orn M\"oller; Zehang Wu; Matthias Franz; Tim Fingscheidt

arXiv:2601.18396·eess.AS·January 27, 2026

Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder

Zhengyang Li, Thomas Graave, Bj\"orn M\"oller, Zehang Wu, Matthias Franz, Tim Fingscheidt

PDF

Open Access

TL;DR

This paper introduces a dual-use visual feature fusion method in Whisper AV-ASR models, significantly improving noise robustness and establishing new state-of-the-art results in noisy conditions.

Contribution

Proposes a novel dual-use visual fusion approach in Whisper models, enhancing noise robustness and outperforming existing fusion methods in AV-ASR.

Findings

01

35% relative WER reduction in Whisper small model

02

57% relative WER reduction in Whisper medium model

03

Achieves state-of-the-art results on LRS3 benchmark

Abstract

In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing