Walk through Paintings: Egocentric World Models from Internet Priors

Anurag Bagchi; Zhipeng Bao; Homanga Bharadhwaj; Yu-Xiong Wang; Pavel Tokmakov; Martial Hebert

arXiv:2601.15284·cs.CV·January 22, 2026

Walk through Paintings: Egocentric World Models from Internet Priors

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

PDF

Open Access

TL;DR

This paper introduces EgoWM, a versatile method that transforms pretrained video diffusion models into controllable, action-conditioned world models capable of realistic egocentric future predictions across various embodiments and tasks.

Contribution

It presents a novel approach to repurpose existing video models for action-conditioned world modeling without extensive retraining, enabling scalable and generalizable egocentric predictions.

Findings

01

EgoWM improves Structural Consistency Score by up to 80%.

02

Achieves up to six times lower inference latency.

03

Demonstrates robust generalization to unseen environments, including paintings.

Abstract

What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation