SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery
Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi

TL;DR
SatDreamer360 is a novel framework that generates multiview-consistent 360-degree ground-level scenes from satellite images, enabling applications in simulation and autonomous navigation by ensuring geometric and temporal consistency.
Contribution
It introduces a triplane-based scene encoding and a panoramic epipolar-constrained attention mechanism for multiview consistency from a single satellite image.
Findings
Outperforms existing methods in satellite-to-ground alignment
Achieves superior multiview consistency in generated panoramas
Introduces VIGOR++ dataset for evaluation
Abstract
Generating multiview-consistent ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we…
Peer Reviews
Decision·ICLR 2026 Poster
1. Introducing image sequence generation into the satellite-to-ground view image synthesis task enables geometrically consistent sequential images, which are more suitable for downstream applications such as autonomous driving and 3D reconstruction. This approach further enhances the practical value of cross-view generation. 2. The proposed ray-based pixel attention mechanism and the epipolar-constrained attention leverage the geometric priors inherent in the imaging process, thereby improving
1. The proposed sequence generation task relies on a single satellite image and predefined trajectory inputs. However, the authors provide insufficient introduction and discussion regarding the trajectory data, including details such as the number of frames within each trajectory, the spatial intervals between frames, and the relative positioning of the trajectories within the satellite imagery. 2. Although the proposed VIGOR++ dataset expands the number of cities to enhance data diversity, it
* The paper’s objectives are novel and clearly defined. * The proposed approach is logically sound, and the contributions are substantial — including the introduction of a new dataset for a newly defined task. * The experiment demonstrates the effectiveness of the proposed method with state-of-the-art performance.
* Some parts of the paper are not fully explained and require further clarification from the authors (Details shown in the Question part). * The epipolar constraint only ensures local consistency between two frames rather than global consistency, which makes the overall constraint relatively weak. * There aren’t enough experiments to examine how each module affects the consistency metric, which I personally consider a core metric for multi-image generation. The existing ablations—for example, co
1. Satellite-to-ground image generation is a challenging and underexplored topic. It has practical value for simulation, urban modeling, and cross-view localization. 2. The combination of triplane representation, ray-guided sampling, and epipolar attention is technically sound. The pipeline effectively bridges satellite conditioning and panoramic generation. 3. The results show noticeable improvements in both image realism and geometric coherence. The qualitative examples look visually convinc
1. The core architectural ideas—triplane representation (EG3D), ray-based sampling (MVDream, Zero123++), and epipolar-constrained attention (EpiDiff)—are largely from prior work. SatDreamer360 primarily integrates these components within a diffusion framework for a new application. While the system integration is well-executed, it lacks a fundamentally new algorithmic or theoretical contribution. 2. Ablation and analysis are not deep enough. It’s unclear how much each proposed module contributes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Remote Sensing and LiDAR Applications · Automated Road and Building Extraction
