Relational Scene Graphs for Object Grounding of Natural Language Commands
Julia Kuhn, Francesco Verdoja, Tsvetomila Mihaylova, Ville Kyrki

TL;DR
This paper explores how incorporating explicit spatial relations into 3D scene graphs enhances large language models' ability to interpret natural language commands for robot object grounding, demonstrating the benefits and limitations of open-vocabulary relations.
Contribution
It introduces a pipeline combining LLMs and VLMs to add open-vocabulary spatial relations to 3DSGs, improving object grounding in robotic natural language understanding.
Findings
Explicit spatial relations improve object grounding accuracy.
Open-vocabulary spatial edges can be generated from robot images.
Open-vocabulary relations offer limited advantages over closed-vocabulary ones.
Abstract
Robots are finding wider adoption in human environments, increasing the need for natural human-robot interaction. However, understanding a natural language command requires the robot to infer the intended task and how to decompose it into executable actions, and to ground those actions in the robot's knowledge of the environment, including relevant objects, agents, and locations. This challenge can be addressed by combining the capabilities of Large language models (LLMs) to understand natural language with 3D scene graphs (3DSGs) for grounding inferred actions in a semantic representation of the environment. However, many 3DSGs lack explicit spatial relations between objects, even though humans often rely on these relations to describe an environment. This paper investigates whether incorporating open- or closed-vocabulary spatial relations into 3DSGs can improve the ability of LLMs to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robotics and Sensor-Based Localization
