From Single Images to Motion Policies via Video-Generation Environment Representations

Weiming Zhi; Ziyong Ma; Tianyi Zhang; Matthew Johnson-Roberson

arXiv:2505.19306·cs.RO·May 27, 2025

From Single Images to Motion Policies via Video-Generation Environment Representations

Weiming Zhi, Ziyong Ma, Tianyi Zhang, Matthew Johnson-Roberson

PDF

Open Access

TL;DR

This paper introduces VGER, a novel framework that generates environment representations from a single RGB image using video generation models, enabling collision-free motion planning for robots.

Contribution

The paper presents VGER, a new method that leverages video generation and 3D modeling to create environment representations from a single image for motion planning.

Findings

01

VGER produces smooth, geometry-aware robot motions from a single image.

02

The framework effectively handles diverse indoor and outdoor environments.

03

VGER outperforms existing monocular depth-based methods in motion generation.

Abstract

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsSparse Evolutionary Training