GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

TL;DR
GeoDiT introduces a diffusion-based vision-language model for geospatial understanding, enabling parallel, structured scene synthesis and achieving state-of-the-art results in object-centric tasks like captioning and detection.
Contribution
It is the first diffusion-based model tailored for geospatial tasks, challenging autoregressive paradigms and improving structured scene understanding.
Findings
Sets new state-of-the-art on geospatial benchmarks
Significantly improves image captioning and visual grounding
Enhances multi-object detection accuracy
Abstract
Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies
