HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation

Zezeng Li; Weimin Wang; Yuming Zhao; Wenhai Li; Na Lei; and Xianfeng Gu

arXiv:2407.14419·cs.CV·July 8, 2025

HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation

Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, and Xianfeng Gu

PDF

Open Access

TL;DR

HOTS3D introduces a novel spherical optimal transport approach to improve semantic alignment between text and 3D shapes in CLIP-guided generation, achieving more faithful and semantically consistent 3D outputs.

Contribution

This paper presents the first application of hyper-spherical optimal transport for aligning text and image features in 3D generation, addressing high-dimensional challenges with a new mathematical formulation and neural network implementation.

Findings

01

Outperforms state-of-the-art methods in text-to-3D generation

02

Achieves higher semantic consistency in generated shapes

03

Demonstrates effectiveness through extensive experiments

Abstract

Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport(SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · 3D Shape Modeling and Analysis · Human Motion and Animation

MethodsContrastive Language-Image Pre-training · ALIGN