SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

Xianghui Ze; Beiyi Zhu; Zhenbo Song; Jianfeng Lu; Yujiao Shi

arXiv:2506.00600·cs.CV·October 14, 2025

SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi

PDF

Open Access 3 Reviews

TL;DR

SatDreamer360 is a novel framework that generates multiview-consistent 360-degree ground-level scenes from satellite images, enabling applications in simulation and autonomous navigation by ensuring geometric and temporal consistency.

Contribution

It introduces a triplane-based scene encoding and a panoramic epipolar-constrained attention mechanism for multiview consistency from a single satellite image.

Findings

01

Outperforms existing methods in satellite-to-ground alignment

02

Achieves superior multiview consistency in generated panoramas

03

Introduces VIGOR++ dataset for evaluation

Abstract

Generating multiview-consistent $36 0^{\circ}$ ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Introducing image sequence generation into the satellite-to-ground view image synthesis task enables geometrically consistent sequential images, which are more suitable for downstream applications such as autonomous driving and 3D reconstruction. This approach further enhances the practical value of cross-view generation. 2. The proposed ray-based pixel attention mechanism and the epipolar-constrained attention leverage the geometric priors inherent in the imaging process, thereby improving

Weaknesses

1. The proposed sequence generation task relies on a single satellite image and predefined trajectory inputs. However, the authors provide insufficient introduction and discussion regarding the trajectory data, including details such as the number of frames within each trajectory, the spatial intervals between frames, and the relative positioning of the trajectories within the satellite imagery. 2. Although the proposed VIGOR++ dataset expands the number of cities to enhance data diversity, it

Reviewer 02Rating 6Confidence 3

Strengths

* The paper’s objectives are novel and clearly defined. * The proposed approach is logically sound, and the contributions are substantial — including the introduction of a new dataset for a newly defined task. * The experiment demonstrates the effectiveness of the proposed method with state-of-the-art performance.

Weaknesses

* Some parts of the paper are not fully explained and require further clarification from the authors (Details shown in the Question part). * The epipolar constraint only ensures local consistency between two frames rather than global consistency, which makes the overall constraint relatively weak. * There aren’t enough experiments to examine how each module affects the consistency metric, which I personally consider a core metric for multi-image generation. The existing ablations—for example, co

Reviewer 03Rating 4Confidence 4

Strengths

1. Satellite-to-ground image generation is a challenging and underexplored topic. It has practical value for simulation, urban modeling, and cross-view localization. 2. The combination of triplane representation, ray-guided sampling, and epipolar attention is technically sound. The pipeline effectively bridges satellite conditioning and panoramic generation. 3. The results show noticeable improvements in both image realism and geometric coherence. The qualitative examples look visually convinc

Weaknesses

1. The core architectural ideas—triplane representation (EG3D), ray-based sampling (MVDream, Zero123++), and epipolar-constrained attention (EpiDiff)—are largely from prior work. SatDreamer360 primarily integrates these components within a diffusion framework for a new application. While the system integration is well-executed, it lacks a fundamentally new algorithmic or theoretical contribution. 2. Ablation and analysis are not deep enough. It’s unclear how much each proposed module contributes

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Remote Sensing and LiDAR Applications · Automated Road and Building Extraction