VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Yang Ding; Yizhen Zhang; Xin Lai; Ruihang Chu; Yujiu Yang

arXiv:2512.22315·cs.CV·December 30, 2025

VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang

PDF

Open Access 3 Reviews

TL;DR

VideoZoomer introduces a dynamic, reinforcement-learned approach for long video reasoning, enabling models to focus adaptively on critical video segments, significantly improving understanding and reasoning performance over existing methods.

Contribution

The paper presents a novel agentic framework with reinforcement learning for adaptive visual focus in long video reasoning, surpassing prior static sampling techniques.

Findings

01

Outperforms existing open-source models on long video benchmarks.

02

Achieves superior reasoning with reduced frame budgets.

03

Demonstrates emergent complex reasoning capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1) This paper is one of the first cohort to explore the frame selection method by agentic RL in video LLM domain, delivering substantial novelty. Also, the agentic RL framework is not simply adopted from the image/language domain to the video domain. Instead, it incorporates tool calling and methods designed specifically for videos (e.g., temporal zoom-in, on-policy reflection). 2) The resulted agentic model by GRPO training, outperforms the baseline model by a large margin, especially on the l

Weaknesses

Overall this paper is of great technical value and soundness, but there are some minor concerns listed below: 1) There are several minor typos across texts, including line 151: "stragety", line 277: "fotmat" and more. The author should perform grammar and word check throughout the paper; 2) Are other capabilities of video LLMs (Qwen 2.5-VL) well maintained? Like short video captioning? 3) Qwen 2.5-VL is known to lack of native <think></think> reasoning capabilities. The authors performed off

Reviewer 02Rating 4Confidence 5

Strengths

- The paper contributes a training dataset comprising 11,000 trajectories, which is used to enhance the tool-calling capabilities of models. - The case visualizations presented in the paper are good

Weaknesses

- The technical contributions of the paper are limited, as its main novelty lies in providing a curated training dataset to enhance the tool-calling capabilities of models. - The method adds a bonus to tool-call rewards when the final answer is correct, which increases the unnecessary frequency of tool use. For example, the model may continue calling tools unnecessarily, retrieving irrelevant clips even after it already has the correct answer, ultimately inflating the reward. - While the paper

Reviewer 03Rating 4Confidence 3

Strengths

1. This paper introduces a clear agentic framework that couples coarse “glance” perception with targeted temporal zooming, yielding a principled separation between broad coverage and fine-grained evidence acquisition. 2. This paper evaluates across diverse long-video understanding and reasoning benchmarks, with the largest gains on tasks that require precise temporal detail, supporting the method’s intended use case.

Weaknesses

1. The zoom tool is basically one-dimensional: A lot of long-video questions hinge on tiny textual clues (scoreboards, signs, on-screen text), and just “adding frames” won’t reliably capture those. 2. The cold-start data comes from external frontier models (e.g., GPT-style teachers). That brings possible style bias 3. Multi-round zooming can be expensive in practice.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis