TL;DR
UniT introduces a unified physical language that bridges human and humanoid embodiment differences, enabling efficient policy learning and world modeling through a shared latent space grounded in visual consequences.
Contribution
The paper proposes UniT, a novel framework that unifies human and humanoid physical representations, improving data efficiency and enabling zero-shot transfer in policy learning and world modeling.
Findings
Achieves state-of-the-art data efficiency and OOD generalization in humanoid policy learning.
Enables zero-shot task transfer from human to humanoid robots.
Creates a shared latent space that aligns human and humanoid physical representations.
Abstract
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
