Text-Image Conditioned 3D Generation

Jiazhong Cen; Jiemin Fang; Sikuang Li; Guanjun Wu; Chen Yang; Taoran Yi; Zanwei Zhou; Zhikuan Bao; Lingxi Xie; Wei Shen; Qi Tian

arXiv:2603.21295·cs.CV·March 24, 2026

Text-Image Conditioned 3D Generation

Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang, Taoran Yi, Zanwei Zhou, Zhikuan Bao, Lingxi Xie, Wei Shen, Qi Tian

PDF

Open Access

TL;DR

This paper introduces a novel approach for 3D content generation that combines text and image inputs to leverage their complementary strengths, resulting in more flexible and detailed 3D asset creation.

Contribution

The paper formalizes the task of text-image conditioned 3D generation and proposes TIGON, a dual-branch model that effectively fuses visual and textual information for improved 3D synthesis.

Findings

01

Cross-modal fusion outperforms single-modality models.

02

Text-image conditioning enhances 3D generation quality.

03

TIGON demonstrates consistent improvements across experiments.

Abstract

High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis