EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos
Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, Bo Zhao

TL;DR
EgoGrasp is a novel method that reconstructs world-space hand-object interactions from egocentric videos, supporting open-vocabulary objects and overcoming occlusion and computational challenges.
Contribution
It introduces a multi-stage framework combining vision foundation models and diffusion models for accurate, open-vocabulary, and temporally consistent W-HOI estimation from egocentric videos.
Findings
Achieves state-of-the-art W-HOI reconstruction performance
Handles multiple objects and open vocabulary categories robustly
Maintains physical plausibility and temporal consistency
Abstract
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from dynamic egoview videos, supporting open-vocabulary objects. Accurate W-HOI reconstruction is critical for embodied intelligence yet remains challenging. Existing HOI methods are largely restricted to local camera coordinates or single frames, failing to capture global temporal dynamics. While some recent approaches attempt world-space hand estimation, they overlook object poses and HOI constraints. Moreover, previous HOI estimation methods either fail to handle open-set categories due to their reliance on object templates or employ differentiable rendering that requires per-instance optimization, resulting in prohibitive computational costs. Finally, frequent occlusions in egocentric videos severely degrade performance. To overcome these challenges, we propose a multi-stage framework:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning
