Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition

Seaone Ok; Min Jun Choi; Eungbeom Kim; Seungu Han; Kyogu Lee

arXiv:2602.08293·eess.AS·February 10, 2026

Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition

Seaone Ok, Min Jun Choi, Eungbeom Kim, Seungu Han, Kyogu Lee

PDF

Open Access

TL;DR

This paper introduces CoBRA, a novel cross-modal fusion framework for audio-visual speech recognition that enhances noise robustness by using learnable tokens to regulate information exchange between modalities.

Contribution

The paper proposes a bottleneck-based fusion method with learnable tokens, improving noise robustness and efficiency in AVSR systems, especially with limited training data.

Findings

01

Outperforms baseline models in noisy conditions

02

Maintains competitive performance with large-scale systems

03

Identifies fusion depth as a critical factor

Abstract

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition