EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Jaeyoung Choi; Hyeondong Kim; Yujin Kim; Daehee Park

arXiv:2605.07642·cs.CV·May 11, 2026

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Jaeyoung Choi, Hyeondong Kim, Yujin Kim, Daehee Park

PDF

1 Repo

TL;DR

EggHand is a multimodal foundation model that predicts future egocentric hand poses by integrating semantic reasoning and dynamic motion modeling, advancing accuracy and robustness in complex scenarios.

Contribution

It introduces a novel framework combining a Vision-Language-Action model with a video-text encoder for improved egocentric hand pose forecasting.

Findings

01

Sets new state-of-the-art accuracy on EgoExo4D dataset.

02

Remains robust under severe ego-motion conditions.

03

Enables controllable prediction using language prompts.

Abstract

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://jyoun9.github.io/EggHand
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.