TL;DR
EggHand is a multimodal foundation model that predicts future egocentric hand poses by integrating semantic reasoning and dynamic motion modeling, advancing accuracy and robustness in complex scenarios.
Contribution
It introduces a novel framework combining a Vision-Language-Action model with a video-text encoder for improved egocentric hand pose forecasting.
Findings
Sets new state-of-the-art accuracy on EgoExo4D dataset.
Remains robust under severe ego-motion conditions.
Enables controllable prediction using language prompts.
Abstract
Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
