Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation

Andr\'e Schakkal; Ben Zandonati; Zhutian Yang; Navid Azizan

arXiv:2506.22827·cs.RO·July 11, 2025

Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation

Andr\'e Schakkal, Ben Zandonati, Zhutian Yang, Navid Azizan

PDF

TL;DR

This paper introduces a hierarchical vision-language planning system for humanoid robots to perform complex multi-step manipulation tasks, combining low-level control, skill policies, and high-level planning with real-time monitoring.

Contribution

It presents a novel hierarchical framework integrating vision-language models for planning and monitoring multi-step humanoid manipulation tasks.

Findings

01

Achieved 73% success rate over 40 real-world trials

02

Demonstrated effective skill planning and real-time monitoring with VLMs

03

Validated the system on a Unitree G1 humanoid robot performing pick-and-place tasks

Abstract

Enabling humanoid robots to reliably execute complex multi-step manipulation tasks is crucial for their effective deployment in industrial and household environments. This paper presents a hierarchical planning and control framework designed to achieve reliable multi-step humanoid manipulation. The proposed system comprises three layers: (1) a low-level RL-based controller responsible for tracking whole-body motion targets; (2) a mid-level set of skill policies trained via imitation learning that produce motion targets for different steps of a task; and (3) a high-level vision-language planning module that determines which skills should be executed and also monitors their completion in real-time using pretrained vision-language models (VLMs). Experimental validation is performed on a Unitree G1 humanoid robot executing a non-prehensile pick-and-place task. Over 40 real-world trials, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.