Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Yanchun Cheng; Rundong Wang; Xulei Yang; Alok Prakash; Daniela Rus; Marcelo H Ang Jr; ShiJie Li

arXiv:2603.06985·cs.CV·March 10, 2026

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang Jr, ShiJie Li

PDF

Open Access

TL;DR

This paper introduces a perception-aware multimodal reasoning framework that enhances vision-language models with explicit object grounding and a new reasoning dataset, significantly improving spatial understanding in monocular driving scenarios.

Contribution

It proposes a novel object-centric grounding method and a Multimodal Chain-of-Thought dataset to improve geometric perception and reasoning in vision-language models.

Findings

01

Outperforms previous methods on the SURDS benchmark

02

Achieves large gains in single-object and multi-object spatial reasoning tasks

03

Demonstrates the mutual reinforcement of perception and reasoning in monocular vision

Abstract

Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection