RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

Jun Zhang; Jie Feng; Long Chen; Junhui Wang; Zhicheng Liu; Depeng Jin; Yong Li

arXiv:2511.18011·cs.CV·November 25, 2025

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

Jun Zhang, Jie Feng, Long Chen, Junhui Wang, Zhicheng Liu, Depeng Jin, Yong Li

PDF

Open Access

TL;DR

RoadBench is a comprehensive benchmark designed to evaluate multimodal large language models' fine-grained spatial understanding and reasoning in urban road scenarios, revealing significant gaps in current models' capabilities.

Contribution

This work introduces RoadBench, a systematic benchmark with 9,121 test cases focusing on urban road markings and traffic systems to evaluate MLLMs' spatial reasoning.

Findings

01

Existing MLLMs perform poorly on fine-grained urban spatial tasks.

02

Many MLLMs underperform compared to simple rule-based or random baselines.

03

RoadBench exposes critical shortcomings in current models' urban spatial understanding.

Abstract

Multimodal large language models (MLLMs) have demonstrated powerful capabilities in general spatial understanding and reasoning. However, their fine-grained spatial understanding and reasoning capabilities in complex urban scenarios have not received significant attention in the fields of both research and industry. To fill this gap, we focus primarily on road markings as a typical example of fine-grained spatial elements under urban scenarios, given the essential role of the integrated road traffic network they form within cities. Around road markings and urban traffic systems, we propose RoadBench, a systematic benchmark that comprehensively evaluates MLLMs' fine-grained spatial understanding and reasoning capabilities using BEV and FPV image inputs. This benchmark comprises six tasks consisting of 9,121 strictly manually verified test cases. These tasks form a systematic evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Automated Road and Building Extraction · Geographic Information Systems Studies