MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
Oskar Kristoffersen, Alba Reinders S\'anchez, Morten Rieger Hannemose, Anders Bjorholm Dahl, Dim P. Papadopoulos

TL;DR
The paper introduces MMLandmarks, a comprehensive multimodal dataset for geo-spatial understanding, enabling diverse tasks like cross-view retrieval and geolocalization, highlighting the limitations of existing models and the potential of multimodal data.
Contribution
It provides a large, multi-modal benchmark dataset with aligned data for landmarks, facilitating training and evaluation of models across various geo-spatial tasks.
Findings
Current models struggle with multi-modal geo-spatial tasks.
A simple CLIP-inspired baseline demonstrates versatility on MMLandmarks.
The dataset enables broad generalization in geo-spatial understanding.
Abstract
Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, geographic coordinates, etc.). Current benchmarks have limited coverage across modalities, leading to specialized models that perform well in their respective domains, but do not fully take advantage of other geo-spatial modalities. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18.557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one landmark level correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
