HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation
Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, and Xianfeng Gu

TL;DR
HOTS3D introduces a novel spherical optimal transport approach to improve semantic alignment between text and 3D shapes in CLIP-guided generation, achieving more faithful and semantically consistent 3D outputs.
Contribution
This paper presents the first application of hyper-spherical optimal transport for aligning text and image features in 3D generation, addressing high-dimensional challenges with a new mathematical formulation and neural network implementation.
Findings
Outperforms state-of-the-art methods in text-to-3D generation
Achieves higher semantic consistency in generated shapes
Demonstrates effectiveness through extensive experiments
Abstract
Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport(SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · 3D Shape Modeling and Analysis · Human Motion and Animation
MethodsContrastive Language-Image Pre-training · ALIGN
