TL;DR
MEgoHand is a multimodal framework that generates realistic egocentric hand-object interactions from RGB, text, and initial hand pose, improving generalization and stability in AR/VR and robotics.
Contribution
It introduces a bi-level architecture combining vision-language models and flow-matching policies, along with a large curated dataset for robust hand-object motion synthesis.
Findings
Achieves 86.9% reduction in wrist translation error
Reduces joint rotation error by 34.1%
Demonstrates strong generalization across diverse datasets
Abstract
Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
