OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao; Zhaoqing Wang; Qihang Cao; Dongdong Yu; Changhu Wang; Tongliang Liu; Mingming Gong; Jiawang Bian

arXiv:2603.16099·cs.CV·March 18, 2026

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian

PDF

Open Access

TL;DR

OneWorld introduces a novel diffusion framework operating directly in a 3D unified representation space, leveraging a specialized autoencoder and consistency losses to generate high-quality, cross-view consistent 3D scenes.

Contribution

It presents the 3D Unified Representation Autoencoder and novel consistency and manifold-drift techniques for improved 3D scene generation.

Findings

01

Outperforms 2D-based methods in cross-view consistency

02

Generates high-quality 3D scenes

03

Demonstrates robustness across diverse scenes

Abstract

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Advanced Vision and Imaging