Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, Shubham Tulsiani

TL;DR
Track2Act leverages web videos to predict point tracks and infer manipulation plans, enabling zero-shot generalizable robot manipulation across unseen objects and scenes with minimal robot-specific data.
Contribution
The paper introduces a novel framework that predicts point tracks from web videos to generate manipulation plans, reducing reliance on large demonstration datasets.
Findings
Enables zero-shot manipulation of unseen objects and scenes.
Combines web video-based predictions with minimal robot demonstrations.
Achieves diverse real-world manipulation tasks with minimal in-domain data.
Abstract
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Adversarial Robustness in Machine Learning · Human Pose and Action Recognition
