Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Marion Lepert; Jiaying Fang; Jeannette Bohg

arXiv:2508.09976·cs.RO·August 14, 2025

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Marion Lepert, Jiaying Fang, Jeannette Bohg

PDF

Open Access

TL;DR

Masquerade is a novel method that leverages in-the-wild human videos, through editing techniques, to significantly enhance robot policy learning and generalization in complex tasks.

Contribution

The paper introduces a data-editing pipeline that transforms human videos into robot demonstrations, enabling effective pre-training and fine-tuning of robot policies with limited real robot data.

Findings

01

Policies trained with Masquerade outperform baselines by 5-6x on unseen scenes.

02

Pre-training on 675K edited frames improves generalization.

03

Both robot overlay and co-training are essential for performance gains.

Abstract

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Social Robot Interaction and HRI · Robot Manipulation and Learning