DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video
Priyanka Mandikal, Kristen Grauman

TL;DR
DexVIP leverages in-the-wild videos to learn dexterous robotic grasping by incorporating human hand pose priors, enabling scalable, efficient, and demonstration-free training for complex robotic hands.
Contribution
We introduce DexVIP, a novel method that uses human hand pose priors from YouTube videos to train dexterous grasping policies via deep reinforcement learning.
Findings
Outperforms existing methods without hand pose priors
Requires less training time compared to tele-operation-based approaches
Successfully generalizes to 27 different objects
Abstract
Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent's hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab -- a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Robot Manipulation and Learning
