EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Jiaxu Wan; Xu Wang; Mengwei Xie; Hang Zhang; Mu Xu; Yang Han; Hong Zhang; Ding Yuan; Yifan Yang

arXiv:2512.15160·cs.CV·March 23, 2026

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

PDF

Open Access

TL;DR

EagleVision introduces a dual-stage framework that combines geometry-aware frame selection with active BEV-grounded reasoning, significantly improving spatial understanding in video-based reasoning tasks.

Contribution

The paper proposes a novel dual-stage approach integrating semantic-diverse keyframe selection with iterative BEV-grounded reasoning, trained via reinforcement learning, advancing spatial reasoning capabilities.

Findings

01

Achieves state-of-the-art results on VSI-Bench and SQA3D datasets.

02

Effectively combines macro perception and micro verification for spatial reasoning.

03

Outperforms existing open-source vision-language models in spatial tasks.

Abstract

Video-based spatial reasoning -- such as estimating distances, judging directions, or understanding layouts from multiple views -- requires selecting informative frames and, when needed, actively seeking additional viewpoints during inference. Existing multimodal large language models (MLLMs) consume a fixed set of uniformly sampled frames and cannot request new views once reasoning begins, often missing the geometric cues necessary for reliable spatial judgments. We present EagleVision, a dual-stage framework that combines geometry-aware frame selection with active, Bird's-Eye-View (BEV)-grounded reasoning. In the first stage (macro perception), a semantics-perspective-fusion determinantal point process (SPF-DPP) selects a compact set of keyframes that jointly maximize semantic relevance and viewpoint diversity under a fixed token budget. In the second stage (micro verification), the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Face Recognition and Perception