HRP: Human Affordances for Robotic Pre-Training
Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, Abhinav Gupta

TL;DR
This paper introduces a novel pre-training method for robotic vision that leverages human video data to extract affordances, significantly improving robot task performance and generalization across diverse settings.
Contribution
The paper presents a framework for extracting affordances from human videos to pre-train robotic perception models, enhancing their ability to generalize and perform across various tasks and robot morphologies.
Findings
Boosts robot task performance by at least 15%
Improves generalization in out-of-distribution scenarios
Enhances performance across multiple camera views
Abstract
In order to *generalize* to various tasks in the wild, robotic agents will need a suitable representation (i.e., vision network) that enables the robot to predict optimal actions given high dimensional vision inputs. However, learning such a representation requires an extreme amount of diverse training data, which is prohibitively expensive to collect on a real robot. How can we overcome this problem? Instead of collecting more robot data, this paper proposes using internet-scale, human videos to extract "affordances," both at the environment and agent level, and distill them into a pre-trained representation. We present a simple framework for pre-training representations on hand, object, and contact "affordance labels" that highlight relevant objects in images and how to interact with them. These affordances are automatically extracted from human video data (with the help of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Robot Manipulation and Learning · Space Satellite Systems and Control
