Object Fidelity Diffusion for Remote Sensing Image Generation
Ziqi Ye, Shuran Ma, Jie Yang, Xiaoyi Yang, Yi Yang, Ziyang Gong, Xue Yang, Haipeng Wang

TL;DR
This paper introduces Object Fidelity Diffusion (OF-Diff), a novel method for generating high-quality, controllable remote sensing images by extracting object shapes, employing a dual-branch diffusion model, and fine-tuning for diversity and semantic consistency, outperforming existing methods.
Contribution
The paper presents the first approach to incorporate prior object shapes into diffusion models for remote sensing, using a dual-branch structure and a new fine-tuning method to enhance image fidelity and diversity.
Findings
OF-Diff achieves higher image quality metrics than state-of-the-art methods.
Significant improvements in detection accuracy for small and polymorphic objects.
mAP increases of 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles.
Abstract
High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the…
Peer Reviews
Decision·ICLR 2026 Poster
1. ESGM: Leverages pre-trained vision-language and segmentation models (RemoteCLIP and RemoteSAM) to extract precise object shape masks, providing strong geometric priors beyond simple bounding boxes. 2. Employs a teacher-student architecture where a "teacher" decoder (conditioned on both image and shape features) guides a "student" decoder (conditioned only on shape features). This allows the model to learn to generate high-fidelity textures and details without requiring real image references
1. The ESGM module is critically dependent on two large, specialized models: RemoteCLIP and RemoteSAM. While effective, this raises questions about the framework's scalability, accessibility, and potential biases inherited from these foundational models. The paper could benefit from a discussion on the computational cost of this "template extraction" phase and an analysis of how errors from ESGM might propagate through the diffusion pipeline. 2. The paper clearly defines the DDPO reward functio
1. OF-Diff does not require real-image references at inference, a significant practical improvement. 2. State-of-the-art results on both DIOR-R and DOTA datasets, with mAP improvements of up to 8.3% on airplane and 7.7% on ship categories. 3. The paper is well-structured, with clear problem motivation, method description, and experimental analysis.
1. The online distillation and DDPO fine-tuning steps are computationally expensive, but the paper does not report training time, GPU usage, or memory overhead. 2. The paper shows that adding captions improves aesthetics but hurts fidelity (Fig. 7). However, this trade-off is not deeply analyzed. A user study or perceptual evaluation would help clarify when and why to use captions. 3. The method heavily relies on ESGM-generated shape masks. While the paper mentions that distorted masks lead to p
- Authors identify a lack in current literature: few works sucessfully tackle instance-level generation given the difficulty of the task. Instance-level (layout-to-image paradigm) gives more precise control over the generations and alignment with the ground truth conditions. - Authors propose a realiable pipeline for achieving high layout fidelity generations by DDPO finetuning, and without the need of using real control images. - Authors provide ablation studies for the design decisions. - Exte
Authors do not provide any dataset augmentation experiment for OOD-datasets. Such experiment would be useful to prove the usefulness of the model beyond their training dataset distribution, to see if their generations are actually useful for other downstream datasets. I believe this is an important experiment that should be carried out, as it determines the overall usefulness of the generated images not just within the training distribution. I suggest authors to select some other dataset (not D
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Image and Signal Denoising Methods
