GenEx: Generating an Explorable World

Taiming Lu; Tianmin Shu; Junfei Xiao; Luoxin Ye; Jiahao Wang; Cheng; Peng; Chen Wei; Daniel Khashabi; Rama Chellappa; Alan Yuille; Jieneng Chen

arXiv:2412.09624·cs.CV·January 22, 2025

GenEx: Generating an Explorable World

Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng, Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen

PDF

Open Access 2 Models 2 Datasets 3 Reviews

TL;DR

GenEx is a system that generates detailed, 3D-consistent virtual environments from minimal input, enabling AI agents to explore, navigate, and interact within imaginative worlds for advanced embodied AI research.

Contribution

Introducing GenEx, a novel generative model that creates continuous, 3D-consistent environments from limited data, facilitating complex exploration and navigation tasks for embodied AI agents.

Findings

01

High-quality 3D environment generation from minimal input

02

Robust loop consistency over long trajectories

03

Effective 3D mapping and exploration capabilities

Abstract

Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

The concept is intriguing and the explanation is clear and straightforward.

Weaknesses

1. There are some errors in the mathematical formulations presented, particularly in equations (3) and (4), which are confusing. 2. Although SCL is highlighted as a contribution, its effectiveness is not demonstrated in the experimental results. 3. The use of latent diffusion with temporal attention is not a novel architecture. 4. The real-world dynamics of vehicles do not allow for pure rotation, which the paper seems to overlook. 5, Table 3 presents an unfair comparison.

Reviewer 02Rating 8Confidence 3

Strengths

+The idea of build a Generative World Explorer is interesting, and I think it will be useful to the development of embodied AI research. + It's practical to apply the proposed Genex to the embodied decision making process.

Weaknesses

-There is a gap between the training data (synthesized with unity) and test data (captured from google street), the degrees of freedom of in the observation perspectives, google street seems to more limited compared to the unity. But the gap between training and test data may not be always "bad", because such gap may show more "Generalizability". -In the following sentence “An embodied agent is inherently a POMDP agent (Kaelbling et al., 1998): instead of full observation, the agent has only

Reviewer 03Rating 6Confidence 3

Strengths

1. Leveraging generative models to complete the partial observations to a full “world” understanding is reasonable to utilize the priors learned from the data. 2. For the panorama representation, they design the spherical-consistent learning during their learning process to improve the consistency of the panorama image. From their results, the panorama truly shows better consistency and leads to better representation of the scene. 3. The authors conduct extensive experiments and create a benchma

Weaknesses

1. In this work, the authors actually construct an explicit representation for “the imagination prior” to make decision making. However, in the benchmark setting, most questions seem only related to a specific case. For single-agents, just try to avoid some unseen cars. And for multi-agent, try to make the other two agents avoid collision. The task setting seems not challenging and common enough to demonstrate the usefulness of such imagination ability. Also it’s hard to see the real performance

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications