Generative Image as Action Models

Mohit Shridhar; Yat Long Lo; Stephen James

arXiv:2407.07875·cs.RO·October 10, 2024

Generative Image as Action Models

Mohit Shridhar, Yat Long Lo, Stephen James

PDF

Open Access 1 Repo 3 Reviews

TL;DR

GENIMA leverages fine-tuned diffusion models to generate visuomotor policies from images, demonstrating superior robustness and generalization in manipulation tasks without relying on depth or motion priors.

Contribution

This work introduces GENIMA, a novel approach that uses diffusion models for visuomotor control by translating actions into visual targets, outperforming existing methods in robustness and generalization.

Findings

01

Outperforms state-of-the-art visuomotor approaches in robustness and generalization.

02

Achieves competitive performance with 3D agents without depth or motion priors.

03

Effective on both simulated and real-world manipulation tasks.

Abstract

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.

Peer Reviews

Decision·CoRL 2024

Reviewer 01Rating 3Confidence 3

Reviewer 02Rating 3Confidence 5

Reviewer 03Rating 3Confidence 3

Code & Models

Repositories

MohitShridhar/genima
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNarrative Theory and Analysis · Visual Culture and Art Theory

MethodsDiffusion