TL;DR
RoboAgent introduces a capability-driven planning framework that decomposes complex embodied tasks into manageable vision-language problems, enabling improved multi-turn reasoning and control without external tools.
Contribution
It presents a unified VLM-based planning pipeline with multi-stage training, enhancing embodied task planning through internal capability invocation and synthetic data augmentation.
Findings
Outperforms existing methods on standard benchmarks.
Enables transparent and controllable multi-turn reasoning.
Uses a single VLM for all planning capabilities.
Abstract
This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
