SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, Long Chen

TL;DR
This paper introduces SURDS, a comprehensive benchmark for evaluating spatial reasoning in vision language models within driving scenarios, and proposes reinforcement learning-based alignment to improve their performance.
Contribution
The paper presents SURDS, a large-scale benchmark for spatial reasoning in autonomous driving, and demonstrates that reinforcement learning-based alignment enhances VLMs' spatial understanding.
Findings
VLMs show limited spatial reasoning capabilities on SURDS.
Reinforcement learning-based alignment improves VLMs' spatial reasoning scores.
GRPO-aligned models outperform proprietary systems like GPT-4o and Gemini-2.0-flash.
Abstract
Accurate spatial reasoning in outdoor environments - covering geometry, object pose, and inter-object relationships - is fundamental to downstream tasks such as mapping, motion forecasting, and high-level planning in autonomous driving. We introduce SURDS, a large-scale benchmark designed to systematically evaluate the spatial reasoning capabilities of vision language models (VLMs). Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples, spanning six spatial categories: orientation, depth estimation, pixel-level localization, pairwise distance, lateral ordering, and front-behind relations. We benchmark leading general-purpose VLMs, including GPT, Gemini, and Qwen, revealing persistent limitations in fine-grained spatial understanding. To address these deficiencies, we go beyond static evaluation and explore whether…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
