LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan, Iyengar, David F. Fouhey, Joyce Chai

TL;DR
LLM-Grounder is a zero-shot 3D visual grounding method that leverages large language models to interpret complex natural language queries and identify objects in 3D scenes without requiring labeled training data.
Contribution
It introduces a novel pipeline combining LLMs with visual grounding tools for open-vocabulary 3D grounding, enabling generalization to new scenes and queries without training.
Findings
Achieves state-of-the-art zero-shot accuracy on ScanRefer.
Effectively handles complex language queries with LLM decomposition.
Does not require labeled training data, enabling broad applicability.
Abstract
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
