Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Felix Wimbauer; Fabian Manhardt; Michael Oechsle; Nikolai Kalischek; Christian Rupprecht; Daniel Cremers; Federico Tombari

arXiv:2603.28980·cs.CV·May 4, 2026

Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Felix Wimbauer, Fabian Manhardt, Michael Oechsle, Nikolai Kalischek, Christian Rupprecht, Daniel Cremers, Federico Tombari

PDF

TL;DR

Stepper is a new framework that generates high-quality, consistent 3D immersive scenes from text by expanding panoramas stepwise, combining diffusion models with geometry reconstruction.

Contribution

It introduces a multi-view 360° diffusion model and a geometry pipeline, trained on a large dataset, to improve fidelity and structural coherence in scene synthesis.

Findings

01

Achieves state-of-the-art fidelity in immersive scene generation

02

Outperforms prior methods in structural consistency

03

Enables high-resolution panoramic scene expansion

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360{\deg} diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.