Constructing a 3D Scene from a Single Image

Kaizhi Zheng; Ruijian Zha; Zishuo Xu; Jing Gu; Jie Yang; Xin Eric Wang

arXiv:2505.15765·cs.CV·October 7, 2025

Constructing a 3D Scene from a Single Image

Kaizhi Zheng, Ruijian Zha, Zishuo Xu, Jing Gu, Jie Yang, Xin Eric Wang

PDF

Open Access 3 Reviews

TL;DR

SceneFuse-3D is a training-free framework that synthesizes coherent 3D scenes from a single top-down image by decomposing the scene into regions and using spatial-aware inpainting, outperforming existing methods in quality and coherence.

Contribution

We introduce SceneFuse-3D, a novel training-free approach that combines region-based generation and spatial-aware 3D inpainting for high-quality, coherent 3D scene synthesis from a single image.

Findings

01

Outperforms state-of-the-art methods in geometry quality and coherence

02

Generates high-fidelity 3D scenes without training or supervision

03

Effective across diverse scene types

Abstract

Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce SceneFuse-3D, a training-free framework designed to synthesize coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* SceneFuse-3D employs a training-free approach, which utilizes existing models to accomplish the scene generation task without requiring fine-tuning of the base models. * Using existing foundation models (e.g., depth estimation, Florence2, and SAM2) to provide spatial priors and effectively stabilize global layout and cross-region consistency. * The paper is well structured in general.

Weaknesses

* The method appears to rely heavily on external priors (e.g., monocular depth, Florence2, SAM2, ICP), which may propagate errors throughout the pipeline. * Some of the generated scenes appear to contain holes (e.g., in Figure 1 and the supplementary materials). * The proposed method seems to support only top-down views from specific angles as input images.

Reviewer 02Rating 8Confidence 4

Strengths

- The proposed method effectively modifies and aggregates existing solutions from different subproblems into a single pipeline to solve general problem of generating a 3D scene. - The proposed method shows higher qualitative and quantitative performances than previous training-free and model-based generation results as elaborated in multiple tables in the manuscript. - The ablation study shows the two originality of the paper, i.e., region-based generation and landmark conditioning do help the g

Weaknesses

Although I believe the current version of the manuscript is above acceptance threshold, there are some limitations that prevents me recommending for higher honor (e.g., Highlight/Oral). 1. First of all, since the paper utilizes various methods that have been proposed beforehand, some modified and some unmodified, it would be much better to have the summarization table (somewhere in the appendices) that shows which part of the pipeline is operated by which method, thereby implying modular upgrad

Reviewer 03Rating 4Confidence 4

Strengths

This paper provides a sound pipeline that leverages 3D generation model, Trellis, to generate 3D scenes. 1. It adapts a 2D diffusion method to 3D generation, and develops a region-by-region structured latent generation method. 2. It presents a masked rectified flow method to retain the latent feature at know voxels. 3. The experimental results verify the advantage of the proposed method.

Weaknesses

1. the proposed method relies on the top-down view of a 3D scene as an condition in the 3D scene generation, thus the generated ground plane is generally flat. It might be difficult to handle terrains. 2. The experimental results contain 4 scenes, which is not enough to verify the stability of the proposed pipeline. In addition, how the method is influenced by different depth estimation method? I would like to see how this pipeline works with the STOA depth estimation methods.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage

MethodsInpainting