Relational Scene Graphs for Object Grounding of Natural Language Commands

Julia Kuhn; Francesco Verdoja; Tsvetomila Mihaylova; Ville Kyrki

arXiv:2602.04635·cs.RO·February 5, 2026

Relational Scene Graphs for Object Grounding of Natural Language Commands

Julia Kuhn, Francesco Verdoja, Tsvetomila Mihaylova, Ville Kyrki

PDF

Open Access

TL;DR

This paper explores how incorporating explicit spatial relations into 3D scene graphs enhances large language models' ability to interpret natural language commands for robot object grounding, demonstrating the benefits and limitations of open-vocabulary relations.

Contribution

It introduces a pipeline combining LLMs and VLMs to add open-vocabulary spatial relations to 3DSGs, improving object grounding in robotic natural language understanding.

Findings

01

Explicit spatial relations improve object grounding accuracy.

02

Open-vocabulary spatial edges can be generated from robot images.

03

Open-vocabulary relations offer limited advantages over closed-vocabulary ones.

Abstract

Robots are finding wider adoption in human environments, increasing the need for natural human-robot interaction. However, understanding a natural language command requires the robot to infer the intended task and how to decompose it into executable actions, and to ground those actions in the robot's knowledge of the environment, including relevant objects, agents, and locations. This challenge can be addressed by combining the capabilities of Large language models (LLMs) to understand natural language with 3D scene graphs (3DSGs) for grounding inferred actions in a semantic representation of the environment. However, many 3DSGs lack explicit spatial relations between objects, even though humans often rely on these relations to describe an environment. This paper investigates whether incorporating open- or closed-vocabulary spatial relations into 3DSGs can improve the ability of LLMs to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robotics and Sensor-Based Localization