GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs

TL;DR
GeoDiT is a novel diffusion transformer model that enables flexible, semantically rich satellite image generation controlled by point-based spatial and textual inputs, outperforming existing models.
Contribution
We propose a point-based conditioning framework with an adaptive local attention mechanism for satellite image synthesis, reducing annotation effort and improving generation quality.
Findings
GeoDiT surpasses state-of-the-art models in satellite image generation.
The point-based control provides semantically rich and flexible image synthesis.
Adaptive local attention effectively regularizes attention scores based on input points.
Abstract
We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
