OCR-Enhanced Multimodal ASR Can Read While Listening

Junli Chen; Changli Tang; Yixuan Li; Guangzhi Sun; Chao Zhang

arXiv:2601.18393·cs.SD·January 27, 2026

OCR-Enhanced Multimodal ASR Can Read While Listening

Junli Chen, Changli Tang, Yixuan Li, Guangzhi Sun, Chao Zhang

PDF

Open Access

TL;DR

This paper introduces Donut-Whisper, a multimodal ASR model that leverages visual subtitles and audio to improve speech recognition in English and Chinese, demonstrating significant performance gains over baselines.

Contribution

It proposes a novel audio-visual ASR model with dual encoders, a cross-attention module for modality alignment, and a lightweight knowledge distillation scheme, along with a new multilingual dataset.

Findings

01

Achieved 5.75% WER reduction on English

02

Achieved 16.5% CER reduction on Chinese

03

Demonstrated superior performance over baseline models

Abstract

Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis