UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Boyu Chen; Yi Chen; Lu Qiu; Jerry Bai; Yuying Ge; and Yixiao Ge

arXiv:2604.19734·cs.RO·April 22, 2026

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge

PDF

1 Repo

TL;DR

UniT introduces a unified physical language that bridges human and humanoid embodiment differences, enabling efficient policy learning and world modeling through a shared latent space grounded in visual consequences.

Contribution

The paper proposes UniT, a novel framework that unifies human and humanoid physical representations, improving data efficiency and enabling zero-shot transfer in policy learning and world modeling.

Findings

01

Achieves state-of-the-art data efficiency and OOD generalization in humanoid policy learning.

02

Enables zero-shot task transfer from human to humanoid robots.

03

Creates a shared latent space that aligns human and humanoid physical representations.

Abstract

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xpeng-robotics/UniT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.