TL;DR
SHARP is a novel, training-free method that adaptively promotes resolution in remote sensing image synthesis, improving realism and multi-scale generation by aligning positional encoding with the denoising process.
Contribution
It introduces a domain-specific prior trained on RS images and a dynamic positional adaptation strategy that enhances resolution promotion during diffusion-based synthesis.
Findings
SHARP outperforms all training-free baselines on multiple metrics.
It maintains robustness across various resolutions with negligible computational overhead.
The method effectively balances layout formation and detail recovery in RS image synthesis.
Abstract
Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
