EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

Hongming Fu; Wenjia Wang; Xiaozhen Qiao; Rolandos Alexandros Potamias; Taku Komura; Shuo Yang; Zheng Liu; Bo Zhao

arXiv:2601.01050·cs.CV·March 17, 2026

EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, Bo Zhao

PDF

Open Access

TL;DR

EgoGrasp is a novel method that reconstructs world-space hand-object interactions from egocentric videos, supporting open-vocabulary objects and overcoming occlusion and computational challenges.

Contribution

It introduces a multi-stage framework combining vision foundation models and diffusion models for accurate, open-vocabulary, and temporally consistent W-HOI estimation from egocentric videos.

Findings

01

Achieves state-of-the-art W-HOI reconstruction performance

02

Handles multiple objects and open vocabulary categories robustly

03

Maintains physical plausibility and temporal consistency

Abstract

We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from dynamic egoview videos, supporting open-vocabulary objects. Accurate W-HOI reconstruction is critical for embodied intelligence yet remains challenging. Existing HOI methods are largely restricted to local camera coordinates or single frames, failing to capture global temporal dynamics. While some recent approaches attempt world-space hand estimation, they overlook object poses and HOI constraints. Moreover, previous HOI estimation methods either fail to handle open-set categories due to their reliance on object templates or employ differentiable rendering that requires per-instance optimization, resulting in prohibitive computational costs. Finally, frequent occlusions in egocentric videos severely degrade performance. To overcome these challenges, we propose a multi-stage framework:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning