Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D   Generation

Xianghui Yang; Huiwen Shi; Bowen Zhang; Fan Yang; Jiacheng Wang,; Hongxu Zhao; Xinhai Liu; Xinzhou Wang; Qingxiang Lin; Jiaao Yu; Lifu Wang,; Jing Xu; Zebin He; Zhuo Chen; Sicong Liu; Junta Wu; Yihang Lian; Shaoxiong; Yang; Yuhong Liu; Yong Yang; Di Wang; Jie Jiang; Chunchao Guo

arXiv:2411.02293·cs.CV·January 24, 2025

Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang,, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang,, Jing Xu, Zebin He, Zhuo Chen, Sicong Liu, Junta Wu, Yihang Lian, Shaoxiong, Yang, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, Chunchao Guo

PDF

Open Access 10 Models 1 Datasets

TL;DR

Hunyuan3D 1.0 introduces a fast, two-stage unified framework for text- and image-conditioned 3D generation, significantly improving speed and quality over previous diffusion-based models.

Contribution

It presents a novel two-stage approach combining multi-view diffusion and rapid 3D reconstruction, supporting both text and image conditioning in a unified framework.

Findings

01

Generates multi-view images in ~4 seconds.

02

Reconstructs 3D assets in ~7 seconds.

03

Achieves a good balance between speed and quality.

Abstract

While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D 1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

tencent/HY3D-Bench
dataset· 67k dl
67k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings