TL;DR
SUGAR is a scalable framework that converts human videos into generalizable humanoid loco-manipulation skills without task-specific reward engineering or reference motion conditioning, enabling zero-shot real-world transfer.
Contribution
It introduces a fully automated pipeline and a physics-based refiner to transform human videos into high-fidelity humanoid skills for diverse tasks.
Findings
Outperforms reference-tracking baselines in simulation and real-world tasks.
Performance improves with more human video data.
Achieves zero-shot transfer with reliable execution and failure recovery.
Abstract
Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
