3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos
Adam Hung, Bardienus Pieter Duisterhof, Jeffrey Ichnowski

TL;DR
3PoinTr introduces a transformer-based method for pretraining robot manipulation policies from casual human videos by predicting 3D point tracks, enabling robust generalization with minimal demonstrations.
Contribution
The paper presents a novel approach using 3D point track prediction with a transformer architecture for embodiment-agnostic robot policy pretraining from unconstrained human videos.
Findings
Achieves robust spatial generalization with only 20 demonstrations.
Outperforms existing behavior cloning and pretraining methods.
Produces more accurate 3D point tracks than baseline models.
Abstract
Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Social Robot Interaction and HRI
