m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Yosub Shin; Michael Buriek; Igor Molybog

arXiv:2601.19099·cs.CV·January 28, 2026

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Yosub Shin, Michael Buriek, Igor Molybog

PDF

Open Access

TL;DR

This paper introduces m2sv, a new scalable benchmark for evaluating spatial reasoning in vision-language models by aligning overhead maps with street view images, revealing significant performance gaps and challenges.

Contribution

The paper presents m2sv, a novel benchmark with diverse data and structured reasoning traces, to evaluate and improve spatial reasoning in multimodal models.

Findings

01

Best VLM achieves only 65.2% accuracy on m2sv

02

Supervised fine-tuning improves performance but transfer remains limited

03

Analysis reveals gaps in geometric alignment and reasoning consistency

Abstract

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization