Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation
Andr\'e Schakkal, Ben Zandonati, Zhutian Yang, Navid Azizan

TL;DR
This paper introduces a hierarchical vision-language planning system for humanoid robots to perform complex multi-step manipulation tasks, combining low-level control, skill policies, and high-level planning with real-time monitoring.
Contribution
It presents a novel hierarchical framework integrating vision-language models for planning and monitoring multi-step humanoid manipulation tasks.
Findings
Achieved 73% success rate over 40 real-world trials
Demonstrated effective skill planning and real-time monitoring with VLMs
Validated the system on a Unitree G1 humanoid robot performing pick-and-place tasks
Abstract
Enabling humanoid robots to reliably execute complex multi-step manipulation tasks is crucial for their effective deployment in industrial and household environments. This paper presents a hierarchical planning and control framework designed to achieve reliable multi-step humanoid manipulation. The proposed system comprises three layers: (1) a low-level RL-based controller responsible for tracking whole-body motion targets; (2) a mid-level set of skill policies trained via imitation learning that produce motion targets for different steps of a task; and (3) a high-level vision-language planning module that determines which skills should be executed and also monitors their completion in real-time using pretrained vision-language models (VLMs). Experimental validation is performed on a Unitree G1 humanoid robot executing a non-prehensile pick-and-place task. Over 40 real-world trials, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
