Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto; Junjue Wang; Weihao Xuan; Naoto Yokoya

arXiv:2512.07276·cs.CV·December 23, 2025

Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya

PDF

Open Access

TL;DR

Geo3DVQA introduces a new benchmark for evaluating vision-language models on 3D geospatial reasoning from aerial RGB imagery, highlighting current limitations and the impact of domain-specific tuning.

Contribution

The paper presents Geo3DVQA, a comprehensive benchmark for RGB-based 3D geospatial reasoning, and systematically evaluates state-of-the-art models revealing their limitations and improvements through instruction tuning.

Findings

01

RGB models show fundamental limitations in 3D spatial reasoning.

02

Domain-specific instruction tuning improves model performance.

03

Benchmark includes 110k questions across 16 task categories.

Abstract

Three-dimensional geospatial analysis is critical for applications in urban planning, climate adaptation, and environmental assessment. However, current methodologies depend on costly, specialized sensors, such as LiDAR and multispectral sensors, which restrict global accessibility. Additionally, existing sensor-based and rule-driven methods struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We present Geo3DVQA, a comprehensive benchmark that evaluates vision-language models (VLMs) in height-aware 3D geospatial reasoning from RGB imagery alone. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios integrating elevation, sky view factors, and land cover patterns. The benchmark comprises 110k curated question-answer pairs across 16 task categories, including single-feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Remote Sensing and LiDAR Applications