Perception-Aware Multimodal Spatial Reasoning from Monocular Images
Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang Jr, ShiJie Li

TL;DR
This paper introduces a perception-aware multimodal reasoning framework that enhances vision-language models with explicit object grounding and a new reasoning dataset, significantly improving spatial understanding in monocular driving scenarios.
Contribution
It proposes a novel object-centric grounding method and a Multimodal Chain-of-Thought dataset to improve geometric perception and reasoning in vision-language models.
Findings
Outperforms previous methods on the SURDS benchmark
Achieves large gains in single-object and multi-object spatial reasoning tasks
Demonstrates the mutual reinforcement of perception and reasoning in monocular vision
Abstract
Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
