Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan

TL;DR
Hand2World is a novel autoregressive framework that synthesizes photorealistic egocentric videos with hand-object interactions, addressing challenges like occlusion, camera motion, and arbitrary-length generation for augmented reality and embodied AI.
Contribution
The paper introduces a unified autoregressive model with occlusion-invariant hand conditioning and explicit camera geometry, enabling stable, long-term egocentric interaction video synthesis from a single scene image.
Findings
Significant improvements in perceptual quality and 3D consistency.
Supports camera control and long-horizon interactive generation.
Outperforms existing methods on three egocentric interaction benchmarks.
Abstract
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Interactive and Immersive Displays
