SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Xianda Guo; Ruijun Zhang; Yiqun Duan; Yuhang He; Dujun Nie; Wenke Huang; Chenming Zhang; Shuai Liu; Hao Zhao; Long Chen

arXiv:2411.13112·cs.CV·May 28, 2025

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, Long Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces SURDS, a comprehensive benchmark for evaluating spatial reasoning in vision language models within driving scenarios, and proposes reinforcement learning-based alignment to improve their performance.

Contribution

The paper presents SURDS, a large-scale benchmark for spatial reasoning in autonomous driving, and demonstrates that reinforcement learning-based alignment enhances VLMs' spatial understanding.

Findings

01

VLMs show limited spatial reasoning capabilities on SURDS.

02

Reinforcement learning-based alignment improves VLMs' spatial reasoning scores.

03

GRPO-aligned models outperform proprietary systems like GPT-4o and Gemini-2.0-flash.

Abstract

Accurate spatial reasoning in outdoor environments - covering geometry, object pose, and inter-object relationships - is fundamental to downstream tasks such as mapping, motion forecasting, and high-level planning in autonomous driving. We introduce SURDS, a large-scale benchmark designed to systematically evaluate the spatial reasoning capabilities of vision language models (VLMs). Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples, spanning six spatial categories: orientation, depth estimation, pixel-level localization, pairwise distance, lateral ordering, and front-behind relations. We benchmark leading general-purpose VLMs, including GPT, Gemini, and Qwen, revealing persistent limitations in fine-grained spatial understanding. To address these deficiencies, we go beyond static evaluation and explore whether…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiandaguo/drive-mllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications