TL;DR
TerraGen is a unified framework for generating spatially controlled remote sensing images to improve multiple vision tasks, addressing the limitations of task-specific models and incorporating geographical constraints.
Contribution
It introduces a multi-task layout-to-image generation framework with a novel spatial encoding scheme and provides the first large-scale dataset for remote sensing layout generation.
Findings
Achieves superior image quality across tasks
Enhances downstream task performance significantly
Demonstrates robust cross-task generalization
Abstract
Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbf{TerraGen}, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. TerraGen can handle multiple remote sensing tasks (object detection, segmentation, etc.) within a single model. 2. The authors constructed a dataset of 45k images with layout annotations to train and evaluate their model. 3. Extensive experiments show that TerraGen can serve as a data augmentation engine, boosting performance on downstream tasks in both full-data and few-shot settings.
While the task of unified multi-task generation for remote sensing is valuable and the constructed dataset is a potential contribution, the paper has significant flaws that preclude its acceptance in its current form. My primary concerns are as follows: 1. The core technical components, i.e., a layout encoder, multi-scale feature injection, and a mask-weighted loss, are well-established adaptations of techniques from the natural image domain (e.g., GLIGEN, ControlNet, IP-Adapter). The paper pre
- Authors propose a multi-task unified architecture. Multi-tasks as mainly handled by converting the input conditions (bbox, segm map,...) into a common format. To differentiate between tasks, authors use a task encoder that generates task-specific embeddings - Authors introduce a hierarchical mechanism to inject spatial information at multiple resolutions. - Authors carry out extensive experiments, showing how TerraGen improves generation metrics compared to other models for satellite images.
- Spelling mistakes in Figure 2: Dncoder - Image generation is constrained to RGB images. It is worth noting that in remote sensing, satellite images have additional channel bands and wavelength frequencies. Authors should consider satellite image generation that supports the physical satellite spectrum/channel range, resulting in more physically-plausible reconstructions.
This paper presents a well-executed and timely study on a novel problem: unified multi-task layout generation for remote sensing data. The experimental validation is thorough and compelling, convincingly demonstrating the framework's state-of-the-art performance and its significant utility as a data augmentation engine across multiple tasks and data regimes.
The primary weaknesses of this paper concern the technical depth and clarity of its methodological contributions. 1.Limited Technical Innovation: While the concept of a unified multi-task framework is valuable, its core technical components, such as the layout encoder and multi-scale injection, appear to be straightforward adaptations of existing mechanisms (e.g., cross-attention, ControlNet) rather than fundamental innovations. The paper does not sufficiently justify why these specific composit
1. A layout-to-image generation framework for remote sensing imagery is proposed, 2. A remote sensing layout generation dataset is constructed.
1. This paper lacks novelty. Throughout the paper, the so-called “first unified framework” essentially combines and fine-tunes existing methods (such as GLIGEN and ControlNet) for remote sensing data, lacking fundamental innovation. Most importantly, the proposed method relies solely on attention mechanisms without introducing any explicit geographical rules or knowledge, offering no fundamental improvement over existing approaches. 2. The geographic-spatial layout encoder is essentially a conca
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
