Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

Linzhi Wu; Xingyu Zhang; Hao Yuan; Yakun Zhang; Changyan Zheng; Liang Xie; Tiejun Liu; Erwei Yin

arXiv:2601.12436·eess.AS·March 9, 2026

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei Yin

PDF

Open Access

TL;DR

This paper introduces a mask-free, end-to-end audio-visual speech recognition framework that enhances robustness in noisy environments by implicitly refining audio features with video assistance, outperforming mask-based methods.

Contribution

The proposed framework eliminates the need for explicit noise masks and improves noise robustness by leveraging a Conformer-based fusion module for implicit audio feature refinement.

Findings

01

Outperforms mask-based baselines on LRS3 benchmark in noisy conditions.

02

Effectively preserves speech semantics while reducing noise interference.

03

Demonstrates robustness without explicit noise masking strategies.

Abstract

Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition