Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
Zhen Liu, Xinyu Ning, Zhe Hu, Xinxin Xie, Weize Li, Zhipeng Tang, Chongyu Wang, Zejun Yang, Hanlin Wang, Yitong Liu, Zhongzhu Pu

TL;DR
Goal2Skill introduces a dual-system framework for long-horizon embodied manipulation, combining high-level planning with low-level visuomotor control to improve robustness and success rates in complex tasks.
Contribution
The paper presents a novel dual-system approach that separates semantic planning from motor execution, enabling memory-aware reasoning and adaptive recovery in long-horizon tasks.
Findings
Achieved a 32.4% success rate on RMBench tasks, outperforming the 9.8% of the best baseline.
Structured memory and closed-loop recovery significantly improve task success.
The framework effectively handles partial observability, occlusions, and multi-stage dependencies.
Abstract
Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
