HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai; Meng Li; Xinyuan Lv; Jiawei Wang; Xinhua Wang; Fei Liao; Chengkai Hou; Langzhe Gu; Wanqi Zhou; Kun Wu; Ziluo Ding; Zhiyuan Xu; Lei Sun; Shanghang Zhang; Zhengping Che; Jian Tang; Badong Chen

arXiv:2604.07993·cs.RO·May 20, 2026

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

PDF

1 Models

TL;DR

HEX is a novel framework enabling coordinated whole-body manipulation in humanoid robots by integrating a universal state representation, a mixture-of-experts predictor, and visual-language cues, leading to superior real-world performance.

Contribution

The paper introduces HEX, a state-centric, scalable approach for humanoid manipulation that models whole-body coordination and integrates visual-language cues effectively.

Findings

01

HEX achieves state-of-the-art success rates in real-world humanoid tasks.

02

The framework generalizes well to new tasks and fast-reaction scenarios.

03

Lightweight history tokens improve temporal context understanding without heavy computation.

Abstract

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Cognition2ActionLab/HEX-model
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.