World Models for Learning Dexterous Hand-Object Interactions from Human Videos
Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

TL;DR
This paper introduces DexWM, a world model for dexterous hand-object interactions that predicts future states from fine-grained finger actions extracted from videos, enabling effective transfer to robotic manipulation tasks.
Contribution
DexWM is the first world model to incorporate finger keypoints and hand configuration consistency for modeling dexterous manipulation from human videos.
Findings
DexWM outperforms prior models in future-state prediction.
It achieves over 50% improvement in zero-shot transfer to robotic tasks.
It successfully models fine-grained dexterous interactions from video data.
Abstract
Modeling dexterous hand-object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine-grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non-dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Social Robot Interaction and HRI
