Hierarchical World Models as Visual Whole-Body Humanoid Controllers
Nicklas Hansen, Jyothir S V, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su

TL;DR
This paper introduces a hierarchical reinforcement learning framework for visual whole-body control of humanoid robots, achieving high performance and human-like motions in complex tasks without simplifying assumptions.
Contribution
It presents a novel hierarchical world model that enables data-driven visual control of high-DoF humanoids without reward engineering or skill primitives.
Findings
Achieves high performance in 8 simulated humanoid tasks
Synthesizes motions broadly preferred by humans
Operates without simplifying assumptions or reward design
Abstract
Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.
Peer Reviews
Decision·ICLR 2025 Poster
1. The hierarchical world model, which integrates high-level visual guidance with low-level proprioceptive control, is novel in its simplicity and efficacy, especially in achieving natural motion without predefined rewards or skill primitives. 2. Puppeteer advances visual whole-body humanoid control by setting new standards for naturalness and efficiency in motion synthesis. The zero-shot generalization to unseen tasks demonstrates the model’s potential for practical application.
1. Lack of low-level tracking performance evaluation. There is no evaluation or metrics for the tracking accuracy of success rate. There are several works both from simulated avatars community [1,2] and real-world humanoids [3,4] that evaluate the tracking performance. I am supurised that these works are not mentioned and their metircs are not used for evaluation in this work. [1] Luo, Z., Cao, J., Kitani, K., & Xu, W. (2023). Perpetual humanoid control for real-time simulated avatars. In Proc
1. The research addresses a significant and practical challenge in generalist agents: controlling a humanoid agent from visual observations using generalizable world models. 2. The methodology involves training a low-level agent on trajectory tracking that is adaptable across a range of control tasks, showing promising generalizability. 3. A high-level agent controls the humanoid from visual observations, a task-specific but broadly applicable approach in real-world scenarios. 4. A user study va
1. The evaluation heavily relies on the "naturalness" of movements, which depends on subjective human judgments of what is considered "human-like." This criterion, while important, may not fully evaluate the feasibility of such motions in actual humanoid robots, which face different kinematic and dynamic constraints than humans. 2. Based on Figure 5, the episodic return of the baseline TD-MPC2 is comparable or superior to the proposed method across most tasks. It would be beneficial to evaluate
- the paper is generally well-written and an enjoyable read, I also liked the figures and plots. - the proposed approach is very interesting and promising and is a natural next step to extend the TD-MPC2 framework. - the method is evaluated on multiple humanoid tasks including environments with only proprioception as well as others with additional visual observations. - the ablations nicely evaluate the role of the different design choices of the method. I especially appreciate the study of the
- the method section misses a detailed motivation for why hierarchy improves the naturalness of the motions. - the method section misses a detailed explanation concerning the exact usage of the high-level commands (see question 2). - the paper introduces a hierarchical version of td-mpc2, the baselines however do not include a single hierarchical RL approach, I would at least consider including a hierarchical implementation of dreamer [1]. - [main weakness] The results of the paper are weak, at
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods
