ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, Chong Luo

TL;DR
ViaRL introduces a reinforcement learning framework with iterative amplification for adaptive, intention-driven video frame selection, improving temporal grounding without costly annotations.
Contribution
It is the first to apply rule-based RL with iterative training cycles for video understanding, enhancing scalability and performance.
Findings
Achieves nearly 15% improvement on Needle QA benchmark.
Consistently outperforms previous methods across multiple datasets.
Demonstrates robust generalization in diverse video tasks.
Abstract
Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper provided a RL framework for temporal grounding without human annotation. By iteratively refine the selector and the answer model on the training set, the performance of Qwen is improved. 2. The paper also provided useful tricks. For example, using idea from existing work to mark the frame index in the frame corner. 3. The author achieved great performance on Qwen model.
1. The RL is only tested on Qwen-2.5-VL, therefore it is hard to know if the method could generalize to other models. 2. In the data preparation process, CLIP is used to sample relevant frame to the question, which could be inaccurate. CLIP is measuring the semantic similarity between the frame and the answer, while a frame could be barely connected with the question when it is alone but important when in frames context. I also do not see any experiments supporting this sampling process. 3. Tabl
- The paper is well-written and easy to follow. - The training details are explained in detail. It improves the reproducibility of the paper.
- Unclear Inference Cost: The paper motivates its approach by citing the high cost of processing all frames. However, the proposed ViaRL framework requires two sequential MLLM forward passes at inference time - From my perspective, the proposed paper lacks the technical novelty. Compared to existing GRPO-based works, the different part is to introduce frame selection before question answering. However, there have been multiple works that solve question-answering tasks with the frame selection.
The paper presents a novel framework, ViaRL, which leverages rule-based reinforcement learning to optimize frame selection in video understanding tasks. Central to the approach is the Visual Iterated Amplification training strategy, an innovative iterative refinement process that alternates between optimizing the frame selector and the answer model, providing strong motivation and technical soundness. The effectiveness of ViaRL is demonstrated through comprehensive experiments on several challen
The major weaknesses of this work are the following: 1 - It would be nice to see how the method affects other VLMs which are not flexible on the resolution/quality of input data. 2 - While the method performs really well in one of the most challenging problems of video understanding, it lacks comparison in more generic tasks like answer generation (for Q&A and captioning for example) to the the impact of this 'specialization' on other capabilities of the network. 3 - The ablation on cyclic tr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Cognitive Science and Education Research · Video Analysis and Summarization
MethodsFocus
