TL;DR
VidText introduces a comprehensive benchmark for evaluating video text understanding, addressing the gap in existing video and OCR benchmarks by covering diverse scenarios, multilingual content, and multi-level reasoning tasks.
Contribution
The paper presents VidText, a new benchmark with hierarchical evaluation and paired perception tasks, enabling thorough assessment of video text understanding and multimodal reasoning capabilities.
Findings
Current models perform poorly on most tasks
Input resolution and OCR capabilities significantly affect performance
External reasoning strategies improve model understanding
Abstract
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global…
Peer Reviews
Decision·Submitted to ICLR 2026
- Fills a Critical Research Gap: The paper convincingly argues for and fills an important, underexplored niche in multimodal evaluation. Understanding text embedded in dynamic scenes is crucial for holistic video comprehension, and VidText is the first benchmark to address this systematically. - Comprehensive and Well-Designed Benchmark: The benchmark's design is a major strength. The multi-granularity structure (instance, clip, video) tests a wide range of capabilities, and the paired perceptio
- Reliance on Multiple-Choice for Reasoning: For standardization, most reasoning tasks are formulated as multiple-choice questions. This is a practical choice but may not fully capture the nuanced reasoning failures or generative capabilities of models. An analysis of open-ended responses, even on a small subset, could provide complementary insights. - Limited Evaluation of Recent SOTA Models: The benchmark omits newer iterations like Gemini 2.5 Pro and GPT-5. Since these latest models may have
1. The proposed benchmark is human-annotated with a double human check. I believe the manually labeled benchmark benefits the community and could positively guide the development of MLLMs. 2. The OCR-related tasks are interesting. 3. The authors evaluate several frontier methods to show their capabilities in tackling video understanding tasks.
1. I do not think the proposed benchmark evaluates any new aspect in comparison to existing video understanding benchmarks. The proposed benchmark includes 8 question types, all of them has been involved in existing video understanding benchmarks such as Video-MME and MVBench. The average video length is 108.2 seconds, so it is not a long video understanding benchmark. Though the authors claim that the proposed benchmark supports open-ended evaluation, the so-called open-ended protocol can only
* **Comprehensive Benchmark Design:** The hierarchical (video/clip/instance) and paired (perception/reasoning) task structure is a major strength, enabling a much more nuanced evaluation than previous benchmarks. * **Extensive Empirical Analysis:** The evaluation of 18 models provides a valuable snapshot of the field's capabilities and limitations. The ablation studies effectively validate the benchmark's design choices.
* **Outdated Model Comparison:** A significant weakness is the omission of the very latest flagship models (e.g., Gemini 2.5 Pro, GLM-4V). This quickly diminishes the paper's relevance and the persuasiveness of its conclusions about the current state-of-the-art. * **Limited Discussion on Data Biases:** While the dataset is diverse, there is no discussion of potential biases in the video sources (e.g., geographic or cultural biases from YouTube) or the annotation process that might affect mod
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
