Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

Dongdong Wang; Deepak Balakrishnan; Ravi Srinivasan; Shenhao Wang

arXiv:2605.08404·cs.CL·May 12, 2026

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

Dongdong Wang, Deepak Balakrishnan, Ravi Srinivasan, Shenhao Wang

PDF

TL;DR

This paper explores integrating remote sensing imagery with large language models to enhance reasoning about built environments for smart city applications, evaluating multiple models and scales.

Contribution

It introduces a multimodal approach combining remote sensing imagery with LLMs for built environment reasoning, comparing state-of-the-art models.

Findings

01

Remote sensing imagery improves LLM-based built environment reasoning.

02

InternVL and Qwen outperform other models in accuracy and reliability.

03

Multiscale imagery input enhances decision-making capabilities.

Abstract

This work investigates the use of large language models (LLMs) for tasks in smart cities. The core idea is to leverage remote sensing imagery to characterize the built environment, including design suggestions, constructability assessment, landuse patterns, and risk identification. We examine remote sensing imagery at multiple spatial scales as inputs for multimodal language modeling and evaluate their effects on built-environment-related reasoning. In addition, we compare state-of-the-art LLMs, including InternVL and Qwen, in terms of accuracy and reliability when generating built environment recommendations. The results demonstrate the potential of integrating remote sensing imagery with large language models to assist smart cities and decision-making.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.