EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang

TL;DR
EgoDex is a large-scale egocentric video dataset with paired 3D hand tracking, enabling improved imitation learning for dexterous manipulation across diverse household tasks.
Contribution
The paper introduces EgoDex, the largest dataset of its kind, with synchronized hand pose data and egocentric videos, facilitating advancements in manipulation imitation learning.
Findings
Effective hand trajectory prediction models trained on EgoDex.
Benchmark results establish baseline performance for manipulation tasks.
Diverse dataset enables generalization across multiple household activities.
Abstract
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194…
Peer Reviews
Decision·ICLR 2026 Poster
The strengths lie in the benchmark design and the extensive effort to collect a well-calibrated dataset. Specifically, they are - Big and diverse. Way larger than prior human/robot sets, with language, camera extrinsics, and dense dexterity labels—covering ~200 tasks and ~500 objects. - Clean paired signals. Synced ego RGB + full 3D skeleton (wrists/fingertips/head/arms) at 30 Hz, which is much cleaner than post-hoc hand pose from internet videos. - Benchmark-ready. Two well-defined tasks (tr
The weaknesses include: - Benchmark scope is narrow. The evaluation focuses on human motion prediction only, without assessing the robot side (e.g., retargeting quality and policy performance). Human trajectory prediction is only an intermediate signal; good imitation error does not necessarily translate to task success—this disconnect has been noted before [1], and that’s based on robot data—all the more so for human trajectory prediction as the retargeting might also amplify the error. A more
1. The paper is well-written and well-organized. 2. The proposed dataset is very useful. 3. The dataset statistics are extensively provided.
1. Some more pilot experiments should be provided, as mentioned in Section 6. It would be great to see that the downstream experiments are *actually* deployed, not just discussed, to prove the significance of the dataset. 2. More baseline or backbone networks should also be evaluated in the benchmark experiment. 3. Two recent works [1, 2] should be discussed and compared, especially [1]. *Refs*: [1] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation. Zhou et al. [2
1. **Largest Scale and Rich Annotation**: EgoDex represents the largest-scale egocentric video dataset of human hand manipulation to date. The inclusion of precise, quantified camera parameters and 3D hand pose labels is a significant contribution, providing an invaluable resource for both video understanding and the development of embodied agents. 2. **Uniquely Naturalistic Data**: The data is collected from spontaneous, active human execution rather than unnatural, deliberately posed or slow-
1. **Limited Auxiliary Sensor Data**: The current collection is primarily focused on RGB video. The study could have been significantly enhanced by incorporating additional hardware information during data collection, such as depth maps or synchronized foreground/background segmentation masks. This supplementary data would facilitate lifting observations to 3D and enable subsequent researchers to confidently develop tasks involving procedural background randomization/generation. 2. **Insufficie
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsFocus
