Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Dayou Li; Lulin Liu; Bangya Liu; Shijie Zhou; Jiu Feng; Ziqi Lu; Minghui Zheng; Chenyu You; Zhiwen Fan

arXiv:2603.13615·cs.CV·March 17, 2026

Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Dayou Li, Lulin Liu, Bangya Liu, Shijie Zhou, Jiu Feng, Ziqi Lu, Minghui Zheng, Chenyu You, Zhiwen Fan

PDF

Open Access

TL;DR

EgoHOI is a novel egocentric world model that generates photorealistic hand-object interactions from user actions alone, leveraging physics-informed embeddings to ensure physical accuracy without future object state inputs.

Contribution

The paper introduces EgoHOI, a new egocentric human-object interaction model that predicts realistic interactions using physics priors, surpassing prior methods that rely on future object states.

Findings

01

EgoHOI outperforms strong baselines on HOT3D dataset.

02

Physics-informed embeddings improve interaction realism.

03

Ablation studies confirm the importance of geometric and kinematic priors.

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Social Robot Interaction and HRI