EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

TL;DR
EagleVision introduces a dual-stage framework that combines geometry-aware frame selection with active BEV-grounded reasoning, significantly improving spatial understanding in video-based reasoning tasks.
Contribution
The paper proposes a novel dual-stage approach integrating semantic-diverse keyframe selection with iterative BEV-grounded reasoning, trained via reinforcement learning, advancing spatial reasoning capabilities.
Findings
Achieves state-of-the-art results on VSI-Bench and SQA3D datasets.
Effectively combines macro perception and micro verification for spatial reasoning.
Outperforms existing open-source vision-language models in spatial tasks.
Abstract
Video-based spatial reasoning -- such as estimating distances, judging directions, or understanding layouts from multiple views -- requires selecting informative frames and, when needed, actively seeking additional viewpoints during inference. Existing multimodal large language models (MLLMs) consume a fixed set of uniformly sampled frames and cannot request new views once reasoning begins, often missing the geometric cues necessary for reliable spatial judgments. We present EagleVision, a dual-stage framework that combines geometry-aware frame selection with active, Bird's-Eye-View (BEV)-grounded reasoning. In the first stage (macro perception), a semantics-perspective-fusion determinantal point process (SPF-DPP) selects a compact set of keyframes that jointly maximize semantic relevance and viewpoint diversity under a fixed token budget. In the second stage (micro verification), the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Face Recognition and Perception
