SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang

TL;DR
SceneMaker introduces a decoupled framework for open-set 3D scene generation that improves geometry quality and pose accuracy under occlusion by leveraging diverse datasets and advanced attention mechanisms.
Contribution
It decouples de-occlusion from 3D object generation and proposes a unified pose estimation model with global and local attention, enhancing open-set scene generation.
Findings
Outperforms existing methods in high-quality geometry reconstruction.
Achieves more accurate pose estimation under severe occlusion.
Demonstrates superior generalization on indoor and open-set scenes.
Abstract
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The divide-and-conquer design, which aims to maximize the utilization of existing priors, is well-motivated, and the quantitative evaluation further validates its effectiveness. 2. The specifically designed dataset for de-occlusion and pose estimation could facilitate future research in related fields. 3. The unified pose estimation network achieves strong performance, which may inspire future researches on scene-level pose estimation.
1. In L85, the author claims that SceneMaker is the first decoupled framework to divide the task into de-occlusion, 3D generation, and pose estimation, which is inaccurate. Early work such as Gen3DSR[1] also adopts a divide-and-conquer strategy. 2.In L154-155 and L202-204, the authors state that CAST3D[2] lacks interaction between objects, which is misleading. Section 5 of the CAST3D[2] paper explicitly describes post-processing procedures to adjust the layout under physical constraints. 3.Th
+ The idea is simple and writing is clear. + The paper achieves good numbers compared to the scene generation methods.
- **System Paper and Novelty**: The primary concern is that the work is a systems paper that assembles existing foundation models (Grounded-SAM, DINOv2, Diffusion, etc.) into a pipeline. The core methodology or models are not novel contributions by the authors. Simply constructing a de-occlusion dataset/test-set is likely insufficient for ICLR acceptance given this lack of foundational novelty. Suggestion: The authors should focus on improving or outperforming one of the constituent foundation m
- The idea of decoupling de-occlusion, 3D object generation, and pose estimation makes sense and is executed cleanly. The modular design lets each part specialize, and the authors make reasonable choices about what priors each module should learn from. In my opinion, the most meaningful contribution here is probably the data: the 10K de-occlusion dataset and especially the 200K synthetic scenes used for pose estimation. These fill a clear gap in prior work and seem to drive much of the model’s p
- Pipeline complexity: The proposed system involves many moving parts, eg, depth estimation, segmentation, a diffusion-based de-occlusion module, 3D object generation, and a custom module for pose estimation. While the modular design has it merits (see Strengths), coordinating all these components can make the pipeline hard to reproduce or extend. Each module requires either training or careful finetuning with custom training data, and errors in early stages (e.g., segmentation or depth estimati
1. By first generating de-occluded images and then doing 3D object generation, the quality and accuracy of individual 3D object geometry can be improved. 2. I appreciate that the authors curated datasets for both de-occlusion finetuning and pose estimation training, which also improve the performance.
1. From the teaser and the qualitative comparison, the visual quality of the proposed method improved over the baselines. However, the pose of the objects and the relations between the objects do not seem to be very faithful to the input scene image. For example, for the second image in the teaser, the orientation of the bookshelf seems to be not accurate; in the fourth image, the chair is overlapped with the table. Also, in Figure 7 (a) and (d) is not very easy to tell if the proposed method is
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Robot Manipulation and Learning
