Collaborative Multi-Modal Coding for High-Quality 3D Generation
Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

TL;DR
TriMM is a novel 3D generative model that effectively integrates multiple data modalities like RGB and point clouds, leading to high-quality 3D asset creation with limited training data.
Contribution
The paper introduces TriMM, the first feed-forward 3D-native model that collaboratively encodes multi-modal data and employs a triplane diffusion approach for superior 3D generation.
Findings
Achieves competitive 3D generation quality with limited data.
Successfully incorporates diverse multi-modal datasets.
Demonstrates robustness across multiple datasets.
Abstract
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques
