TL;DR
VideoDetective introduces a novel framework for long video understanding that combines query relevance and intrinsic video structure to improve clue localization in question answering tasks.
Contribution
It integrates query-to-segment relevance with inter-segment affinity using a visual-temporal graph and a hypothesis-verification-refinement loop, enhancing accuracy in long-video comprehension.
Findings
Achieves up to 7.5% accuracy improvement on VideoMME-long benchmark.
Effectively localizes critical video segments for question answering.
Demonstrates consistent gains across multiple MLLMs and benchmarks.
Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
