3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Hao Tang; Ting Huang; Zeyu Zhang

arXiv:2601.06496·cs.CV·January 13, 2026

3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Hao Tang, Ting Huang, Zeyu Zhang

PDF

Open Access

TL;DR

3D CoCa v2 introduces a robust 3D captioning framework that combines contrastive learning with test-time search, significantly enhancing generalization and performance across diverse 3D environments without retraining.

Contribution

The paper presents a novel 3D captioning model that unifies contrastive vision-language learning with caption generation and employs test-time search for improved robustness and OOD generalization.

Findings

01

Achieves +1.50 CIDEr on ScanRefer

02

Improves zero-shot OOD performance by +3.8 CIDEr

03

Outperforms previous models on multiple benchmarks

Abstract

Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Human Motion and Animation