VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Ruoliu Yang; Chu Wu; Caifeng Shan; Ran He; Chaoyou Fu

arXiv:2603.22285·cs.CV·May 4, 2026

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu

PDF

2 Repos

TL;DR

VideoDetective introduces a novel framework for long video understanding that combines query relevance and intrinsic video structure to improve clue localization in question answering tasks.

Contribution

It integrates query-to-segment relevance with inter-segment affinity using a visual-temporal graph and a hypothesis-verification-refinement loop, enhancing accuracy in long-video comprehension.

Findings

01

Achieves up to 7.5% accuracy improvement on VideoMME-long benchmark.

02

Effectively localizes critical video segments for question answering.

03

Demonstrates consistent gains across multiple MLLMs and benchmarks.

Abstract

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.