RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

Peiran Xu; Jiaqi Zheng; Yadong Mu

arXiv:2604.07774·cs.RO·April 10, 2026

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

Peiran Xu, Jiaqi Zheng, Yadong Mu

PDF

1 Repo 1 Models

TL;DR

RoboAgent introduces a capability-driven planning framework that decomposes complex embodied tasks into manageable vision-language problems, enabling improved multi-turn reasoning and control without external tools.

Contribution

It presents a unified VLM-based planning pipeline with multi-stage training, enhancing embodied task planning through internal capability invocation and synthetic data augmentation.

Findings

01

Outperforms existing methods on standard benchmarks.

02

Enables transparent and controllable multi-turn reasoning.

03

Uses a single VLM for all planning capabilities.

Abstract

This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

woyut/RoboAgent_CVPR26
github

Models

🤗
woyut/RoboAgent_CVPR26
model· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.