Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

Zheng Wang; Haoran Chen; Haoxuan Qin; Zhipeng Wei; Tianwen Qian; Cong Bai

arXiv:2603.04977·cs.CV·March 6, 2026

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai

PDF

Open Access

TL;DR

This paper introduces VideoHV-Agent, a hypothesis-verification framework for long video question answering that improves accuracy, interpretability, and efficiency by emphasizing deliberate task formulation before evidence retrieval.

Contribution

The paper proposes a novel structured hypothesis-verification approach for long video understanding, emphasizing reasoning before retrieval to reduce errors and improve interpretability.

Findings

01

Achieves state-of-the-art accuracy on three benchmarks.

02

Enhances interpretability and logical soundness.

03

Reduces computational cost compared to previous methods.

Abstract

Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling