Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
Zixing Lei, Changxing Liu, Yichen Xiong, Minhao Xiong, Yuanzhuo Ding, Zhipeng Zhang, Weixin Li, Siheng Chen

TL;DR
This paper introduces VLAs-as-Tools, a modular approach combining a high-level vision language model with specialized tools for long-horizon embodied tasks, improving planning and execution success.
Contribution
It proposes a novel framework that distributes planning and execution across a VLM agent and specialized VLA tools, with a new interface and training method for tool alignment.
Findings
Improves success rate by 4.8 points on LIBERO-Long.
Enhances success by 23.1 points on RoboTwin.
Increases invocation fidelity by 15 points.
Abstract
Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
