GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained   Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan; Sebastian Stapf; Ahmad Rahimi; Pedro M B Rezende,; Yasaman Haghighi; David Br\"uggemann; Isinsu Katircioglu; Lin Zhang; Xiaoran; Chen; Suman Saha; Marco Cannici; Elie Aljalbout; Botao Ye; Xi Wang; Aram; Davtyan; Mathieu Salzmann; Davide Scaramuzza; Marc Pollefeys; Paolo Favaro,; Alexandre Alahi

arXiv:2412.11198·cs.CV·December 17, 2024

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende,, Yasaman Haghighi, David Br\"uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran, Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram, Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys

PDF

Open Access 1 Repo

TL;DR

GEM is a multimodal ego-vision world model that predicts future scenes with precise control over object, ego-motion, and human pose dynamics, enabling diverse and consistent long-term scene generation.

Contribution

We introduce GEM, a novel generalizable multimodal world model with autoregressive noise schedules and a new evaluation metric for controllability.

Findings

01

GEM achieves high-quality, controllable long-horizon scene generation.

02

The model outperforms baselines in diversity and temporal consistency.

03

Our dataset and evaluation framework advance multimodal scene understanding.

Abstract

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-epfl/gem
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Image Processing and 3D Reconstruction · Simulation and Modeling Applications