EgoForge: Goal-Directed Egocentric World Simulator

Yifan Shen; Jiateng Liu; Xinzhuo Li; Yuanzhe Liu; Bingxuan Li; Houze Yang; Wenqi Jia; Yijiang Li; Tianjiao Yu; James Matthew Rehg; Xu Cao; Ismini Lourentzou

arXiv:2603.20169·cs.CV·March 23, 2026

EgoForge: Goal-Directed Egocentric World Simulator

Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, Xu Cao, Ismini Lourentzou

PDF

Open Access

TL;DR

EgoForge is a novel egocentric world simulation framework that generates coherent first-person videos from minimal inputs, improving goal alignment and scene consistency through trajectory-guided diffusion refinement.

Contribution

The paper introduces EgoForge, a goal-directed egocentric world simulator that produces realistic video rollouts from limited inputs, with a new diffusion-based refinement method for enhanced temporal and semantic coherence.

Findings

01

EgoForge outperforms baseline models in semantic alignment and scene stability.

02

The approach demonstrates robustness in real-world smart-glasses experiments.

03

Trajectory-guided diffusion improves goal completion and perceptual fidelity.

Abstract

Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition