Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang; Daiki Shimada; Hang Hua; Chao Huang; Jing Bi; Rogerio Feris; Chenliang Xu

arXiv:2511.17490·cs.CV·November 27, 2025

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

PDF

Open Access

TL;DR

Video-R4 introduces an iterative visual rumination approach for text-rich video reasoning, enabling models to re-inspect and focus on critical regions, significantly improving accuracy on various video QA tasks.

Contribution

The paper presents a novel multi-stage training framework for a large language model that performs iterative visual rumination, enhancing pixel-grounded multimodal reasoning capabilities.

Findings

01

Achieves state-of-the-art results on M4-ViteVQA.

02

Generalizes well to document, slides, and generic video QA.

03

Demonstrates the effectiveness of iterative rumination for fine-grained reasoning.

Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)