Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai; Danny Tran; Amir Bar; Yann LeCun; Trevor Darrell; Jitendra Malik

arXiv:2506.21552·cs.CV·June 27, 2025

Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

PDF

Open Access

TL;DR

This paper introduces PEVA, a model that predicts egocentric video conditioned on human actions and body pose, advancing understanding of how physical actions influence first-person environments.

Contribution

It presents a novel diffusion transformer model trained on a large egocentric dataset, incorporating hierarchical evaluation for complex environment and behavior prediction.

Findings

01

Effective prediction of environment changes from human actions.

02

Hierarchical evaluation protocol reveals model's strengths and limitations.

03

Model demonstrates potential for embodied agent control tasks.

Abstract

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsDiffusion