TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment

Jiarun Liu; Qifeng Chen; Yiru Zhao; Minghua Liu; Baorui Ma; Sheng Yang

arXiv:2601.19247·cs.CV·January 28, 2026

TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment

Jiarun Liu, Qifeng Chen, Yiru Zhao, Minghua Liu, Baorui Ma, Sheng Yang

PDF

Open Access 3 Reviews

TL;DR

TIGaussian introduces a novel framework leveraging 3D Gaussian Splatting and multi-branch tokenization to improve cross-modal alignment between 3D data, images, and text, achieving state-of-the-art results in 3D-related tasks.

Contribution

The paper proposes a multi-branch 3DGS tokenizer and bidirectional cross-modal alignment strategies to enhance 3D feature extraction and modality bridging in visual-language models.

Findings

01

Achieves state-of-the-art performance on multiple 3D-related tasks.

02

Effectively resolves perspective ambiguity using diffusion priors.

03

Improves 3D and cross-modal feature alignment accuracy.

Abstract

While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is well-structured and easy-to-follow. It focuses on improving different designs of current 3DGS-based text-image-3D contrastive method and shows clear improvement. 2. The multi-vew image fusion idea makes intuitive sense to me. Forcing the alignment of a 3D object and a single-view 2D image is clearly suboptimal.

Weaknesses

1. The benchmark comparison is insufficient. Only three baseline methods are listed in Table 1/2. Strong competitors like ULIP [1], OpenShape [2], and MixCon3D[3] should also be included. For example, OpenShape and MixCon3D score a top-1 accuracy of 46.8 and 52.5 on Objaverse-LVIS, which is significantly better than the 41.76 top-1 accuracy of TIGaussian, making one wonder the effectiveness of the proposed approach. 2. Similarly, results on other important benchmarks like ModelNet40 and ScanObj

Reviewer 02Rating 6Confidence 3

Strengths

1. The TIGAUSSIAN framework proposed in the article is highly targeted. It tackles the fundamental issues of attribute entanglement encoding and single-view bias in 3D multi-modal alignment tasks by creating multi-modal processing modules like the multi-branch 3DGS Tokenizer. It offers a fresh approach to text-image-3DGS multi-modal alignment, surpassing the drawbacks of current techniques. 2. They rigorously validate the effectiveness of the method, and conducts verification on three major dat

Weaknesses

1. The article omits information about the experiment's parameter base and indicators, such as the number of multi-view generations (6) and the 3D-text projection module's parameter sensitivity, which have not been confirmed by tests. Additionally, when it is actually implemented, it does not report the pertinent information (such the inference delay on A100 GPU). 2. It is unclear if baseline models (such UniGS and Duoduo-CLIP) have implemented the same preparation procedures as TIGAUSSIAN (lik

Reviewer 03Rating 6Confidence 5

Strengths

- The paper achieves impressive improvement across various tasks. - The writing is easy to understand and mask sense.

Weaknesses

- Lack of comparisons. One of the core contribution is 3D-aware image feature fusion. However, the usage of multi-view rendering images has been proposed in the JM3D [1], which mitigates the contribution. The authors should either discuss the differences or include a comparison. - Lack of the experiments of perception. Whether JM3D or ULIP has experiments about 3D Res or Object detection to support the ability in sparse perception. The paper needs the similar experiments. - Contribution. In Ta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning