TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She

TL;DR
TimeSearch-R introduces an adaptive reinforcement learning approach for temporal search in long-form videos, integrating reasoning and verification to improve search completeness and understanding accuracy.
Contribution
It reformulates temporal search as interleaved reasoning with self-verification, enabling end-to-end optimization and improved search strategies in long-form video understanding.
Findings
Achieves state-of-the-art results on LongVideoBench with 4.1% improvement.
Significantly outperforms previous methods on Haystack-LVBench and Haystack-Ego4D.
Demonstrates effective integration of self-verification in reinforcement learning for video reasoning.
Abstract
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper is well written and is easy to understand. * The paper proposes a novel algorithm for frame search that does not rely on human annotations and leverages “thinking” with video frames and text. * The authors have a fairly good set of eval benchmarks: VideoMME MLVU and LVB. To better understand when the authors’ work should be used, it would be interesting to see the cases where the base model performs better, or any potential failure modes of the proposed method. * The code, models, a
* Performance is not substantially better than existing methods, often only moderate performance gains at best. However, there are consistent gains on the downstream Video QA task, which indicates the approach can work broadly * This approach requires finetuning the base VLM via RL, whereas other approaches (e.g., T*) do not require any finetuning and can thus leverage API endpoints. * The novelty of the method may be somewhat limited in the sense that it applies a known technique, GRPO, to
Long-form video understanding and reasoning is an important task on which models perform poorly. This approach aims to move the needle and leverages a nice balance of search and "reasoning" (as CoT is usually done) to integrate the video content where appropriate to validate a hypothesis (the question). The approach appears to work well and is intuitive.
1. See questions below 2. I fear that I'm missing some intuition about where to go next after this paper. I'm trying to use the examples in the appendix (Fig 16/17) to reason about where/why things fail and if there's anything that can be done about them. The author's guidance on the importance of specific parameters (e.g. 8) would be appreciated but also where the approach is not appropriate. For example, in the insufficient search, is the data well characterized such that we know how well an
1. Clear motivation and problem formulation. The paper correctly identifies the bottleneck of current Video-LLMs and reformulates temporal search as an interactive reasoning process. 2. The GRPO-CSV algorithm introduces self-verification to provide intermediate supervision within reinforcement learning. 3. The proposed model shows consistent gains across several long-video understanding benchmarks and produces explicit reasoning–search traces, enhancing interpretability over prior approaches.
The main concern lies in model coupling and generalization. It remains unclear whether the learned TimeSearch-R policy is specific to the base VLM (Qwen2.5-VL-7B) or transferable to other model sizes and architectures. Since the reinforcement training is done with a fixed backbone, it is possible that the learned policy overfits to the internal embedding or feature distribution of that model. This would limit the practical usefulness of TimeSearch-R as a general temporal search module for other
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
