DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation
Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

TL;DR
DiffTF++ is a novel 3D-aware diffusion transformer framework that efficiently generates diverse, high-quality 3D assets across many categories by integrating multi-view reconstruction, triplane refinement, and 3D-aware transformers.
Contribution
It introduces DiffTF++, a unified feed-forward model combining improved triplanes, 3D-aware transformers, and multi-view loss for large-vocabulary 3D generation.
Findings
Achieves state-of-the-art 3D generation quality and diversity.
Effectively handles large-scale, multi-category 3D asset synthesis.
Outperforms existing methods on ShapeNet and OmniObject3D datasets.
Abstract
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Human Motion and Animation
MethodsFocus · Diffusion
