Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

Wenjie Tian; Mingchen Shao; Bingshen Mu; Xuelong Geng; Chengyou Wang; Yujie Liao; Zhixian Zhao; Ziyu Zhang; Jingbin Hu; Mengqi Wei; Lei Xie

arXiv:2603.07263·cs.SD·March 10, 2026

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

Wenjie Tian, Mingchen Shao, Bingshen Mu, Xuelong Geng, Chengyou Wang, Yujie Liao, Zhixian Zhao, Ziyu Zhang, Jingbin Hu, Mengqi Wei, Lei Xie

PDF

Open Access

TL;DR

This paper introduces VASR, a multimodal reasoning framework for audio-visual speech recognition that leverages rich visual context and explicit reasoning to improve accuracy and address modality over-reliance.

Contribution

It proposes AV-CoT, a novel reasoning approach that explicitly grounds audio and visual signals, and provides a new dataset and pipeline for context-aware AVSR.

Findings

01

AV-CoT mitigates single-modality dominance effectively.

02

Achieves state-of-the-art results in context-aware AVSR.

03

Open-sourced data pipeline and test set for further research.

Abstract

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing