SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng; Haoning Wu; Ya Zhang; Weidi Xie

arXiv:2508.15769·cs.CV·December 10, 2025

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie

PDF

Open Access 2 Models

TL;DR

SceneGen is a novel framework that generates multiple 3D assets from a single scene image in one pass, without extra optimization, advancing 3D content creation for VR/AR and AI applications.

Contribution

It introduces SceneGen, a feedforward model that synthesizes 3D assets with geometry and texture from scene images and object masks, with a new feature aggregation module.

Findings

01

Operates without extra optimization or retrieval.

02

Improves performance with multi-image input.

03

Produces high-quality 3D assets efficiently.

Abstract

3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Advanced Vision and Imaging