From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control
Jusheng Zhang, Jinzhou Tang, Sidi Liu, Mingyan Li, Sheng Zhang, Jian Wang, Keze Wang

TL;DR
This paper introduces a hierarchical framework called Generative Behavior Control (GBC) that models human behavior by integrating high-level intentions with motion generation, leveraging large language models and a new annotated dataset, GBC-100K.
Contribution
The paper presents a novel unified framework for human behavior modeling that combines task and motion planning guided by LLMs, along with a new dataset GBC-100K with hierarchical annotations.
Findings
GBC generates more diverse and purposeful motions.
GBC achieves 10 times longer planning horizons.
The approach outperforms existing methods in motion diversity and fidelity.
Abstract
Human motion generative modeling or synthesis aims to characterize complicated human motions of daily activities in diverse real-world environments. However, current research predominantly focuses on either low-level, short-period motions or high-level action planning, without taking into account the hierarchical goal-oriented nature of human activities. In this work, we take a step forward from human motion generation to human behavior modeling, which is inspired by cognitive science. We present a unified framework, dubbed Generative Behavior Control (GBC), to model diverse human motions driven by various high-level intentions by aligning motions with hierarchical behavior plans generated by large language models (LLMs). Our insight is that human motions can be jointly controlled by task and motion planning in robotics, but guided by LLMs to achieve improved motion diversity and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper highlights a meaningful research gap, moving from short-term motion generation to long-horizon, goal-directed behavior control, which is conceptually valuable for embodied AI. -The paper proposes a full pipeline combining language planning, motion generation, and physics-based execution. -The proposed dataset GBC-100K is relatively large compared to many prior datasets and includes hierarchical semantic annotations, which could support richer planning and evaluation of long-duratio
- The proposed framework mainly combines existing components: LLM-based behavior planning, Diffusion motion models, and Physics-based controllers. The integration appears incremental without introducing new theoretical insights or algorithmic advances. Claiming a “first unified solution” is overstated, given recent works combining language, motion priors, and controllers. -The dataset is largely auto-annotated using pose estimation with LLM captioning, raising concerns about noise and annotati
1.This paper introduces a hierarchical framework, PHYLOMAN, for Generative Behavior Control (GBC), combining language-driven planning, diffusion-based motion generation, and physics-based control. 2.The paper constructs a large-scale hierarchical text-to-motion dataset with three levels of structured annotations: BehaviorScript, PoseScript, and MotionScript.
1.While the proposed PHYLOMAN framework is structurally coherent, its components—an LLM-based planner, a motion diffusion model, and a physics controller—are largely based on existing paradigms. 2.Although the paper cites MotionAgent (Wu et al., 2024) as a representative language-to-motion framework, there is no direct experimental comparison and analysis. 3.In the main experimental section, PHYLOMAN is not included in the key comparison Table 2, which presents quantitative results across baseli
- GBC formalizes long-term behavior generation, addressing key gaps in motion generation research. - PHYLOMAN integrates hierarchical planning and physics-based control, bridging high-level semantics and low-level execution. - GBC-100K provides a valuable, hierarchically annotated benchmark for behavior generation.
- Claims of goal-orientation and semantic coherence lack rigorous task-driven evaluation. - Comparisons are primarily with motion generation methods, not task-and-motion planning approaches. - Automated annotations may introduce noise; dataset limitations are not fully analyzed. - Lack of detailed ablations to isolate contributions of hierarchical planning and MP-MDM.
1. Ambitious scope: The work reframes the field from motion generation to behavior generation, highlighting the importance of goal-directedness. 2. Scalability: Dataset construction is large-scale, leveraging ∼500k videos and semi-automated annotation pipelines, which could benefit the community if released. 3. Long-horizon motion generation: The MP-MDM parallel generation strategy is technically interesting and addresses efficiency for multi-second or minute-long behaviors.
1. Dataset reliability: * The dataset relies on monocular SMPL estimation (TRAM) as its “gold standard,” which is problematic because SMPL often drifts in translation even when subjects are static (e.g., the provided example clip (`H--TB3aFpxY_000115_000125`) shows the person standing still while SMPL translation varies). This undermines claims of physical plausibility. * Using noisy pseudo-ground-truth motion as the foundation of a benchmark (evaluation target) introduces significant bias; th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbodied and Extended Cognition · Psychiatry, Mental Health, Neuroscience · Social Robot Interaction and HRI
