In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

Xiongyi Cai; Ri-Zhao Qiu; Geng Chen; Lai Wei; Isabella Liu; Tianshu Huang; Xuxin Cheng; Xiaolong Wang

arXiv:2511.15704·cs.RO·November 20, 2025

In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, Xiaolong Wang

PDF

Open Access

TL;DR

This paper introduces a scalable approach for utilizing diverse egocentric videos, both in-the-wild and on-task, to train manipulation policies, significantly enhancing robot learning capabilities through large-scale data and domain adaptation.

Contribution

It provides a systematic method for collecting and leveraging egocentric data, introduces a large dataset PHSD, and demonstrates a language-conditioned policy that benefits from extensive human data and domain adaptation.

Findings

01

Human0 achieves language instruction following from human data.

02

The approach enables few-shot learning for manipulation tasks.

03

Robustness improves with the inclusion of on-task data.

Abstract

Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition