One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

Pengfei Wang; Liyi Chen; Zhiyuan Ma; Yanjun Guo; Guowen Zhang; Lei Zhang

arXiv:2602.19766·cs.CV·March 2, 2026

One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

Pengfei Wang, Liyi Chen, Zhiyuan Ma, Yanjun Guo, Guowen Zhang, Lei Zhang

PDF

Open Access 3 Reviews

TL;DR

One2Scene is a novel framework that generates geometrically consistent, explorable 3D scenes from a single image by decomposing the problem into three sub-tasks, enabling immersive exploration with high fidelity.

Contribution

It introduces a three-step approach combining panorama generation, 3D scaffolding via Gaussian Splatting, and a view generator, improving stability and accuracy over existing methods.

Findings

01

Outperforms state-of-the-art in panorama depth estimation

02

Achieves more accurate 360° scene reconstruction

03

Supports stable, immersive scene exploration from a single image

Abstract

Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

S1: The paper introduces a clever reformulation of monocular 3D scene generation by treating panoramic depth estimation as a multi-view stereo problem, which had not been done before. S2: The feed-forward 3D Gaussian Splatting scaffold runs in under half a second and delivers unprecedented geometric stability without iterative refinement. S3: The bidirectional fusion module and Dual-LoRA conditioning are elegant, practical innovations that significantly improve cross-view consistency and visua

Weaknesses

W1: The method relies on a proprietary panorama generator (Hunyuan-Pano-DiT) without ablation on alternatives, limiting reproducibility. W2: The 3D scaffold still shows minor artifacts in occluded regions under extreme rotations, suggesting room for post-processing or iterative refinement. W3: Evaluation on stylized scenes lacks quantitative metrics for artistic fidelity—CLIP-I and NIQE may not capture stylistic coherence. W4: No comparison to recent diffusion-based single-image 3D baselines

Reviewer 02Rating 4Confidence 3

Strengths

1. The overall pipeline is well-structured and intuitive, decomposing the single-image 3D generation problem into panorama expansion, geometric scaffold reconstruction, and scaffold-guided view synthesis. Each stage has clear motivation and contributes coherently to the final performance. And the paper is easy to follow with clear logistics. 2. The authors conducted comprehensive experiments with both quantitative and qualitative results.The sufficient evidence through extensive metrics makes th

Weaknesses

1. Although quantitative results are sufficient, the paper lacks explicit visualization of geometric outputs such as reconstructed point clouds or extracted scene meshes. There are many visualization with continuous frames in the website, however, geometric consistency is one of the paper’s core claims, these visualizations would provide more direct and convincing evidence of 3D structural accuracy. 2. The method has not been tested on complex, dynamic real-world environments, such as urban or o

Reviewer 03Rating 4Confidence 3

Strengths

The paper appears to have done comprehensive ablations: 1. Comparing sub parts of their approach (the 3D generation) and the image generation part (by using others' 3D representation), finding improvements in both cases. 2. The use of a fixed 3D representation makes a lot of sense to fix geometric mistakes / ambiguity. By creating an approximation of the whole 3D scene a priori, the authors get around issues taht exist in other works due to the accumulation of errors.

Weaknesses

1. Why no comparison with Cat3D ? This seems like an obvious baseline which also does 1 view to multiple views ? The paper focusses on other methods that, to my understanding, seem to be mroe aimed to handle a trajectory of views (e.g. SEVA / AnySplat) and so may in general be expected to handle more complex scenes. While the author's approach is better than these in this case, they should also compare it to methods with more similar aims (e.g. Cat3D). 2. How does the model fair if you don't do

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques