Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

Sushil Samuel Dinesh; Shinkyu Park

arXiv:2510.27558·cs.RO·November 3, 2025

Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

Sushil Samuel Dinesh, Shinkyu Park

PDF

Open Access

TL;DR

This paper introduces a framework that uses pre-trained foundation models and scene graphs to enable accurate long-horizon robotic manipulation without domain-specific training, demonstrating promising experimental results.

Contribution

It presents a novel integration of foundation models with scene graphs for robotic manipulation, eliminating the need for domain-specific training data.

Findings

01

Effective perception and reasoning in manipulation tasks

02

Robust task sequencing demonstrated in experiments

03

Potential for building manipulation systems on off-the-shelf models

Abstract

This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI