GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Jiaqi Liu; Ronghao Fu; Haoran Liu; Lang Sun; Bo Yang

arXiv:2512.02505·cs.CV·March 25, 2026

GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

PDF

Open Access

TL;DR

GeoDiT introduces a diffusion-based vision-language model for geospatial understanding, enabling parallel, structured scene synthesis and achieving state-of-the-art results in object-centric tasks like captioning and detection.

Contribution

It is the first diffusion-based model tailored for geospatial tasks, challenging autoregressive paradigms and improving structured scene understanding.

Findings

01

Sets new state-of-the-art on geospatial benchmarks

02

Significantly improves image captioning and visual grounding

03

Enhances multi-object detection accuracy

Abstract

Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies