UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

Jiaxi Zhang; Yunheng Wang; Wei Lu; Taowen Wang; Weisheng Xu; Shuning Zhang; Yixiao Feng; Yuetong Fang; Renjing Xu

arXiv:2603.08131·cs.RO·March 10, 2026

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu

PDF

Open Access

TL;DR

UniGround introduces a training-free, scene-agnostic approach to 3D visual grounding, enabling robust object localization in complex environments without relying on pre-trained perception models.

Contribution

It proposes a novel training-free reasoning framework for 3D visual grounding that surpasses existing methods in zero-shot scenarios and generalizes well to real-world scenes.

Findings

01

Achieves state-of-the-art zero-shot accuracy on EmbodiedScan

02

Demonstrates robust generalization in real-world, uncontrolled environments

03

Operates without any 3D supervision or pre-trained models

Abstract

Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning