EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque; Peide Huang; David J. Yoon; Mouli Sivapurapu; Jian Zhang

arXiv:2505.11709·cs.CV·March 10, 2026

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang

PDF

Open Access 3 Reviews

TL;DR

EgoDex is a large-scale egocentric video dataset with paired 3D hand tracking, enabling improved imitation learning for dexterous manipulation across diverse household tasks.

Contribution

The paper introduces EgoDex, the largest dataset of its kind, with synchronized hand pose data and egocentric videos, facilitating advancements in manipulation imitation learning.

Findings

01

Effective hand trajectory prediction models trained on EgoDex.

02

Benchmark results establish baseline performance for manipulation tasks.

03

Diverse dataset enables generalization across multiple household activities.

Abstract

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

The strengths lie in the benchmark design and the extensive effort to collect a well-calibrated dataset. Specifically, they are - Big and diverse. Way larger than prior human/robot sets, with language, camera extrinsics, and dense dexterity labels—covering ~200 tasks and ~500 objects. - Clean paired signals. Synced ego RGB + full 3D skeleton (wrists/fingertips/head/arms) at 30 Hz, which is much cleaner than post-hoc hand pose from internet videos. - Benchmark-ready. Two well-defined tasks (tr

Weaknesses

The weaknesses include: - Benchmark scope is narrow. The evaluation focuses on human motion prediction only, without assessing the robot side (e.g., retargeting quality and policy performance). Human trajectory prediction is only an intermediate signal; good imitation error does not necessarily translate to task success—this disconnect has been noted before [1], and that’s based on robot data—all the more so for human trajectory prediction as the retargeting might also amplify the error. A more

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper is well-written and well-organized. 2. The proposed dataset is very useful. 3. The dataset statistics are extensively provided.

Weaknesses

1. Some more pilot experiments should be provided, as mentioned in Section 6. It would be great to see that the downstream experiments are *actually* deployed, not just discussed, to prove the significance of the dataset. 2. More baseline or backbone networks should also be evaluated in the benchmark experiment. 3. Two recent works [1, 2] should be discussed and compared, especially [1]. *Refs*: [1] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation. Zhou et al. [2

Reviewer 03Rating 6Confidence 4

Strengths

1. **Largest Scale and Rich Annotation**: EgoDex represents the largest-scale egocentric video dataset of human hand manipulation to date. The inclusion of precise, quantified camera parameters and 3D hand pose labels is a significant contribution, providing an invaluable resource for both video understanding and the development of embodied agents. 2. **Uniquely Naturalistic Data**: The data is collected from spontaneous, active human execution rather than unnatural, deliberately posed or slow-

Weaknesses

1. **Limited Auxiliary Sensor Data**: The current collection is primarily focused on RGB video. The study could have been significantly enhanced by incorporating additional hardware information during data collection, such as depth maps or synchronized foreground/background segmentation masks. This supplementary data would facilitate lifting observations to 3D and enable subsequent researchers to confidently develop tasks involving procedural background randomization/generation. 2. **Insufficie

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsFocus