See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops
Zixuan Dong, Baoyun Peng, Yufei Wang, Lin Liu, Xinxin Dong, Yunlong Cao, Xiaodong Wang

TL;DR
CAVIA is a training-free framework that enhances video question answering by dynamically coordinating reasoning and perception, leading to state-of-the-art results on multiple benchmarks.
Contribution
It introduces a novel closed-loop system that aligns reasoning with visual extraction, enabling adaptive and efficient video understanding without additional training.
Findings
Achieves state-of-the-art on EgoSchema, NExT-QA, and IntentQA benchmarks.
Demonstrates the effectiveness of reasoning-perception coordination in video understanding.
Outperforms existing methods by significant margins.
Abstract
Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details. However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing. The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements, different queries demand fundamentally different visual evidence from the same video content. In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning, perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
