TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang

TL;DR
TraceGen introduces a 3D trace-space world model that enables robots to learn new tasks from cross-embodiment videos efficiently, significantly reducing data requirements and inference time.
Contribution
The paper presents TraceGen, a novel world model that predicts motion in a symbolic 3D trace-space, facilitating cross-embodiment learning from heterogeneous videos.
Findings
Achieves 80% success with five target videos across four tasks.
Offers 50-600x faster inference than pixel-based models.
Reaches 67.5% success with uncalibrated human videos on real robots.
Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
