Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Weiyu Li; Xuanyang Zhang; Zheng Sun; Di Qi; Hao Li; Wei Cheng; Weiwei Cai; Shihao Wu; Jiarui Liu; Zihao Wang; Xiao Chen; Feipeng Tian; Jianxiong Pan; Zeming Li; Gang Yu; Xiangyu Zhang; Daxin Jiang; Ping Tan

arXiv:2505.07747·cs.CV·May 13, 2025

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan

PDF

Open Access 6 Models 2 Datasets

TL;DR

Step1X-3D introduces an open framework for high-fidelity, controllable textured 3D asset generation, combining a large curated dataset, a hybrid architecture, and open-source tools to advance 3D generative AI.

Contribution

It presents a comprehensive pipeline with a new dataset, a hybrid VAE-DiT and diffusion architecture, and open-source release, addressing key challenges in 3D generative AI.

Findings

01

Achieves state-of-the-art performance on 3D generation benchmarks.

02

Supports direct transfer of 2D control techniques to 3D synthesis.

03

Demonstrates high-quality, controllable textured 3D asset generation.

Abstract

While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Music Technology and Sound Studies