EgoExo-WM: Unlocking Exo Video for Ego World Models
Danny Tran, Roberto Mart\'in-Mart\'in, Kristen Grauman

TL;DR
This paper introduces a method to convert exocentric videos into egocentric data by extracting structured body poses, enabling the use of abundant in-the-wild videos to improve egocentric world models for prediction and planning.
Contribution
The authors propose a novel approach to transform exocentric videos into egocentric representations using human kinematics priors, enhancing egocentric world model training with in-the-wild data.
Findings
Training with converted data improves prediction quality.
Enhanced models lead to better downstream planning performance.
Method enables use of arbitrary in-the-wild videos for egocentric modeling.
Abstract
Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
