m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Yosub Shin, Michael Buriek, Igor Molybog

TL;DR
This paper introduces m2sv, a new scalable benchmark for evaluating spatial reasoning in vision-language models by aligning overhead maps with street view images, revealing significant performance gaps and challenges.
Contribution
The paper presents m2sv, a novel benchmark with diverse data and structured reasoning traces, to evaluate and improve spatial reasoning in multimodal models.
Findings
Best VLM achieves only 65.2% accuracy on m2sv
Supervised fine-tuning improves performance but transfer remains limited
Analysis reveals gaps in geometric alignment and reasoning consistency
Abstract
Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
