Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai; He Zhang; Kai Zhang; Yixun Liang; Mengwei Ren; Fujun Luan; Qing Liu; Soo Ye Kim; Jianming Zhang; Zhifei Zhang; Yuqian Zhou; Yulun Zhang; Xiaokang Yang; Zhe Lin; Alan Yuille

arXiv:2411.14384·cs.CV·October 14, 2025

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, Yuqian Zhou, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces DiffusionGS, a single-stage 3D diffusion model that generates consistent 3D Gaussian point clouds from a single view, enabling fast, scalable, and view-robust 3D object and scene reconstruction.

Contribution

The paper presents DiffusionGS, a novel 3D diffusion model that directly outputs 3D Gaussian point clouds, improving view consistency and scalability over existing multi-view diffusion methods.

Findings

01

Outperforms state-of-the-art in PSNR and FID metrics.

02

Achieves over 5x faster inference speed.

03

Enables robust object and scene generation from a single view.

Abstract

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
CaiYuanhao/DiffusionGS
model· ♡ 1
♡ 1

Datasets

CaiYuanhao/DiffusionGS
dataset· 4.1k dl
4.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · Optical Coherence Tomography Applications

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings