GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Srikumar Sastry; Dan Cher; Brian Wei; Aayush Dhakal; Subash Khanal; Dev Gupta; Nathan Jacobs

arXiv:2603.02172·cs.CV·March 3, 2026

GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs

PDF

Open Access

TL;DR

GeoDiT is a novel diffusion transformer model that enables flexible, semantically rich satellite image generation controlled by point-based spatial and textual inputs, outperforming existing models.

Contribution

We propose a point-based conditioning framework with an adaptive local attention mechanism for satellite image synthesis, reducing annotation effort and improving generation quality.

Findings

01

GeoDiT surpasses state-of-the-art models in satellite image generation.

02

The point-based control provides semantically rich and flexible image synthesis.

03

Adaptive local attention effectively regularizes attention scores based on input points.

Abstract

We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques