3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
Hao Tang, Ting Huang, Zeyu Zhang

TL;DR
3D CoCa v2 introduces a robust 3D captioning framework that combines contrastive learning with test-time search, significantly enhancing generalization and performance across diverse 3D environments without retraining.
Contribution
The paper presents a novel 3D captioning model that unifies contrastive vision-language learning with caption generation and employs test-time search for improved robustness and OOD generalization.
Findings
Achieves +1.50 CIDEr on ScanRefer
Improves zero-shot OOD performance by +3.8 CIDEr
Outperforms previous models on multiple benchmarks
Abstract
Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Human Motion and Animation
