World Models for Learning Dexterous Hand-Object Interactions from Human Videos

Raktim Gautam Goswami; Amir Bar; David Fan; Tsung-Yen Yang; Gaoyue Zhou; Prashanth Krishnamurthy; Michael Rabbat; Farshad Khorrami; Yann LeCun

arXiv:2512.13644·cs.RO·March 18, 2026

World Models for Learning Dexterous Hand-Object Interactions from Human Videos

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DexWM, a world model for dexterous hand-object interactions that predicts future states from fine-grained finger actions extracted from videos, enabling effective transfer to robotic manipulation tasks.

Contribution

DexWM is the first world model to incorporate finger keypoints and hand configuration consistency for modeling dexterous manipulation from human videos.

Findings

01

DexWM outperforms prior models in future-state prediction.

02

It achieves over 50% improvement in zero-shot transfer to robotic tasks.

03

It successfully models fine-grained dexterous interactions from video data.

Abstract

Modeling dexterous hand-object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine-grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non-dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/dexwm
dataset· 454 dl
454 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Social Robot Interaction and HRI