MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou; Yi Zhan; Zhongbin Zhang; Zongqing Lu

arXiv:2505.16602·cs.CV·May 23, 2025

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou, Yi Zhan, Zhongbin Zhang, Zongqing Lu

PDF

1 Video

TL;DR

MEgoHand is a multimodal framework that generates realistic egocentric hand-object interactions from RGB, text, and initial hand pose, improving generalization and stability in AR/VR and robotics.

Contribution

It introduces a bi-level architecture combining vision-language models and flow-matching policies, along with a large curated dataset for robust hand-object motion synthesis.

Findings

01

Achieves 86.9% reduction in wrist translation error

02

Reduces joint rotation error by 34.1%

03

Demonstrates strong generalization across diverse datasets

Abstract

Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation· slideslive