TextVidBench: A Benchmark for Long Video Scene Text Understanding

Yangyang Zhong; Ji Qi; Yuan Yao; Pengxin Luo; Yunfeng Yan; Donglian Qi; Zhiyuan Liu; Tat-Seng Chua

arXiv:2506.04983·cs.CV·June 6, 2025

TextVidBench: A Benchmark for Long Video Scene Text Understanding

Yangyang Zhong, Ji Qi, Yuan Yao, Pengxin Luo, Yunfeng Yan, Donglian Qi, Zhiyuan Liu, Tat-Seng Chua

PDF

Open Access

TL;DR

TextVidBench is a comprehensive benchmark for evaluating long-video scene text understanding, addressing limitations of previous datasets by covering diverse domains, providing detailed annotations, and proposing methods to enhance model performance on videos over three minutes long.

Contribution

The paper introduces TextVidBench, the first long-video scene text understanding benchmark with diverse domain coverage, a three-stage evaluation framework, and high-quality annotations, along with novel methods to improve model performance.

Findings

01

Existing models struggle with long-video scene text understanding.

02

The proposed methods improve temporal perception in large models.

03

TextVidBench presents significant challenges for current models.

Abstract

Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning