From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Xiuchao Sui; Daiying Tian; Qi Sun; Ruirui Chen; Dongkyu Choi; Kenneth Kwok; Soujanya Poria

arXiv:2505.15685·cs.RO·November 4, 2025

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper compares different foundation model integration strategies in embodied robotics, evaluating their effectiveness in instruction understanding and manipulation tasks, and discusses design implications and future challenges.

Contribution

It provides a systematic evaluation of three FM integration paradigms in robotics through case studies, highlighting trade-offs and guiding design choices for language-driven agents.

Findings

01

End-to-end VLA models excel in complex instruction understanding.

02

Modular pipelines offer better data efficiency in skill transfer.

03

Trade-offs exist between generalization and data efficiency in FM-based robotics.

Abstract

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems· underline

Taxonomy

TopicsModular Robots and Swarm Intelligence · Reinforcement Learning in Robotics