DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025

Umihiro Kamoto; Tatsuya Ishibashi; Noriyuki Kugo

arXiv:2506.21891·cs.CV·June 30, 2025

DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025

Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo

PDF

Open Access

TL;DR

This paper introduces DIVE, an iterative reasoning framework for video question answering, which achieved first place in the CVRR Challenge 2025 by accurately answering complex questions about diverse videos.

Contribution

The paper presents a novel iterative reasoning approach for video question answering, demonstrating superior accuracy on the CVRR-ES benchmark compared to prior methods.

Findings

01

Achieved 81.44% accuracy on the CVRR-ES test set.

02

Outperformed all other participants in the CVRR Challenge 2025.

03

Validated the effectiveness of iterative reasoning in complex video QA.

Abstract

In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques