Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, Xi Wang

TL;DR
This paper introduces a novel framework for generating egocentric videos from a single reference frame using sparse 3D hand joints, effectively handling occlusions and ensuring 3D consistency.
Contribution
It proposes an occlusion-aware control module that leverages 3D hand joints and geometric embeddings, along with a large annotated dataset and a cross-embodiment benchmark.
Findings
Outperforms state-of-the-art methods in video quality and realism.
Achieves robust cross-embodiment generalization to robotic hands.
Effectively handles severe occlusions in egocentric video generation.
Abstract
Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
