One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image
Pengfei Wang, Liyi Chen, Zhiyuan Ma, Yanjun Guo, Guowen Zhang, Lei Zhang

TL;DR
One2Scene is a novel framework that generates geometrically consistent, explorable 3D scenes from a single image by decomposing the problem into three sub-tasks, enabling immersive exploration with high fidelity.
Contribution
It introduces a three-step approach combining panorama generation, 3D scaffolding via Gaussian Splatting, and a view generator, improving stability and accuracy over existing methods.
Findings
Outperforms state-of-the-art in panorama depth estimation
Achieves more accurate 360° scene reconstruction
Supports stable, immersive scene exploration from a single image
Abstract
Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to…
Peer Reviews
Decision·ICLR 2026 Poster
S1: The paper introduces a clever reformulation of monocular 3D scene generation by treating panoramic depth estimation as a multi-view stereo problem, which had not been done before. S2: The feed-forward 3D Gaussian Splatting scaffold runs in under half a second and delivers unprecedented geometric stability without iterative refinement. S3: The bidirectional fusion module and Dual-LoRA conditioning are elegant, practical innovations that significantly improve cross-view consistency and visua
W1: The method relies on a proprietary panorama generator (Hunyuan-Pano-DiT) without ablation on alternatives, limiting reproducibility. W2: The 3D scaffold still shows minor artifacts in occluded regions under extreme rotations, suggesting room for post-processing or iterative refinement. W3: Evaluation on stylized scenes lacks quantitative metrics for artistic fidelity—CLIP-I and NIQE may not capture stylistic coherence. W4: No comparison to recent diffusion-based single-image 3D baselines
1. The overall pipeline is well-structured and intuitive, decomposing the single-image 3D generation problem into panorama expansion, geometric scaffold reconstruction, and scaffold-guided view synthesis. Each stage has clear motivation and contributes coherently to the final performance. And the paper is easy to follow with clear logistics. 2. The authors conducted comprehensive experiments with both quantitative and qualitative results.The sufficient evidence through extensive metrics makes th
1. Although quantitative results are sufficient, the paper lacks explicit visualization of geometric outputs such as reconstructed point clouds or extracted scene meshes. There are many visualization with continuous frames in the website, however, geometric consistency is one of the paper’s core claims, these visualizations would provide more direct and convincing evidence of 3D structural accuracy. 2. The method has not been tested on complex, dynamic real-world environments, such as urban or o
The paper appears to have done comprehensive ablations: 1. Comparing sub parts of their approach (the 3D generation) and the image generation part (by using others' 3D representation), finding improvements in both cases. 2. The use of a fixed 3D representation makes a lot of sense to fix geometric mistakes / ambiguity. By creating an approximation of the whole 3D scene a priori, the authors get around issues taht exist in other works due to the accumulation of errors.
1. Why no comparison with Cat3D ? This seems like an obvious baseline which also does 1 view to multiple views ? The paper focusses on other methods that, to my understanding, seem to be mroe aimed to handle a trajectory of views (e.g. SEVA / AnySplat) and so may in general be expected to handle more complex scenes. While the author's approach is better than these in this case, they should also compare it to methods with more similar aims (e.g. Cat3D). 2. How does the model fair if you don't do
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
