MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi; Huanqian Wang; Wulin Xie; Huanyao Zhang; Lijie Zhao; Yi-Fan Zhang; Xinfeng Li; Chaoyou Fu; Zhuoer Wen; Wenting Liu; Zhuoran Zhang; Xinlong Chen; Bohan Zeng; Sihan Yang; Yushuo Guan; Zhang Zhang; Liang Wang; Haoxuan Li; Zhouchen Lin; Yuanxing Zhang; Pengfei Wan; Haotian Wang; Wenjing Yang

arXiv:2505.21333·cs.CV·September 26, 2025

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yushuo Guan, Zhang Zhang, Liang Wang, Haoxuan Li, Zhouchen Lin, Yuanxing Zhang, Pengfei Wan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the MME-VideoOCR benchmark to evaluate multimodal large language models' ability to perform OCR and understanding tasks in videos, revealing current models' limitations in complex, dynamic scenarios.

Contribution

The paper presents a comprehensive video OCR benchmark with diverse tasks and scenarios, and evaluates state-of-the-art models, highlighting their strengths and weaknesses in video comprehension.

Findings

01

Best model achieves 73.7% accuracy on the benchmark.

02

Models perform well on single-frame text recognition but struggle with holistic video understanding.

03

High-resolution input and temporal coverage are crucial for effective video OCR.

Abstract

Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DogNeverSleep/MME-VideoOCR_Dataset
dataset· 226 dl
226 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Speech and dialogue systems