From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li

TL;DR
This paper introduces Anywhere3D-Bench, a comprehensive benchmark for 3D visual grounding across multiple levels, revealing significant challenges and gaps in current models' ability to understand complex 3D scenes beyond objects.
Contribution
The paper presents a new holistic benchmark for multi-level 3D visual grounding, covering space, object, and part levels, and evaluates state-of-the-art models on these challenging tasks.
Findings
Space-level and part-level grounding are the most challenging tasks.
Current models achieve only around 30-40% accuracy on these tasks.
Significant gaps exist in models' ability to understand complex 3D scene semantics.
Abstract
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Tactile and Sensory Interactions
