Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

TL;DR
This paper introduces MoT-HRA, a hierarchical framework that learns human manipulation priors from large-scale demonstrations to improve robotic manipulation tasks.
Contribution
It presents a novel hierarchical model and a large-scale dataset for learning embodiment-agnostic human manipulation priors for robots.
Findings
MoT-HRA improves motion plausibility in manipulation tasks.
The framework enhances robust control under distribution shifts.
The dataset HA-2.2M enables large-scale learning of human manipulation priors.
Abstract
Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
