LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language   Model as an Agent

Jianing Yang; Xuweiyi Chen; Shengyi Qian; Nikhil Madaan; Madhavan; Iyengar; David F. Fouhey; Joyce Chai

arXiv:2309.12311·cs.CV·September 22, 2023·2 cites

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan, Iyengar, David F. Fouhey, Joyce Chai

PDF

Open Access 1 Repo

TL;DR

LLM-Grounder is a zero-shot 3D visual grounding method that leverages large language models to interpret complex natural language queries and identify objects in 3D scenes without requiring labeled training data.

Contribution

It introduces a novel pipeline combining LLMs with visual grounding tools for open-vocabulary 3D grounding, enabling generalization to new scenes and queries without training.

Findings

01

Achieves state-of-the-art zero-shot accuracy on ScanRefer.

02

Effectively handles complex language queries with LLM decomposition.

03

Does not require labeled training data, enabling broad applicability.

Abstract

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sled-group/chat-with-nerf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition