Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma

TL;DR
Zoom-Zero introduces a coarse-to-fine approach for grounded video question answering, enhancing temporal localization and answer accuracy by zooming into salient frames and validating grounding fidelity.
Contribution
It proposes a novel zoom-in accuracy reward and token-selective credit assignment to improve temporal grounding and visual verification in GVQA, addressing limitations of previous methods.
Findings
Improves temporal grounding accuracy by 5.2% on NExT-GQA and 4.6% on ReXTime.
Enhances answer accuracy by 2.4% on average.
Boosts long-video understanding with a 6.4% improvement on benchmarks.
Abstract
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper addresses an important and well-defined limitation in current GRPO-based LVLM reinforcement fine-tuning for video understanding — namely, the gap between answer correctness and temporal grounding fidelity. The authors articulate this motivation clearly and convincingly, showing that existing methods often fail to verify whether grounded frames truly contain the visual evidence supporting the answer. 2. The paper does not simply introduce a two-stage pipeline; it integrates it into a
1. The novelty of this paper is marginally below the acceptance borderline. While the proposed Zoom-Zero framework introduces a coarse-to-fine “temporal zoom-in” mechanism and two reinforcement learning enhancements (the zoom-in accuracy reward and token-selective credit assignment), these components represent incremental extensions rather than substantial methodological breakthroughs. 2. The overall design is heuristic mainly and system-oriented, building on existing GRPO-based RL fine-tuning f
1. **Strong Results.** The proposed model achieves state-of-the-art performance on grounded video QA and long video understanding benchmarks, clearly demonstrating its effectiveness. 2. **Clarity of Writing.** The paper is well-organized and clearly written, allowing readers to easily follow the technical details and rationale of the proposed approach. 3. **Motivation for Token-Level Advantage Estimation.** The paper insightfully identifies and addresses a key limitation in prior GRPO-based meth
1. **Novelty compared to VideoChat-R1.** The method largely resembles those of VideoChat-R1. 1. How is the temporal grounding reward different from the IoU reward used in VideoChat-R1? 2. Similarly, how does the zoom accuracy reward differ from the accuracy or recall reward in VideoChat-R1? 2. **Novelty compared to Qwen2.5-VL + frame selection methods.** Since Qwen2.5-VL inherently supports dynamic frame sampling, the zoom-in capability seems intrinsic to the base model rather than a nov
1. The two innovations -- temporal zoom-in and decoupled credit assignment, sound reasonable and are easy to understand. 2. The presentation is well-structured and easy to read. 3. The experiment results are good (yet to be justified, see weakness).
1. The coarse-to-fine zoom-in strategy would severely increase the memory and computation burden during GRPO optimization. There is a lack of comparison and analysis on this limitation. 2. There are several serious problems in Table 1, making the comparison ineffective. - First, NExT-GQA only provides temporal labels for validation and test sets. It seems that the authors use the validation set for training and compare with those methods for zero-shot testing in Table 1 (all non-RL methods).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
