Generative Image as Action Models
Mohit Shridhar, Yat Long Lo, Stephen James

TL;DR
GENIMA leverages fine-tuned diffusion models to generate visuomotor policies from images, demonstrating superior robustness and generalization in manipulation tasks without relying on depth or motion priors.
Contribution
This work introduces GENIMA, a novel approach that uses diffusion models for visuomotor control by translating actions into visual targets, outperforming existing methods in robustness and generalization.
Findings
Outperforms state-of-the-art visuomotor approaches in robustness and generalization.
Achieves competitive performance with 3D agents without depth or motion priors.
Effective on both simulated and real-world manipulation tasks.
Abstract
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
Peer Reviews
Decision·CoRL 2024
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNarrative Theory and Analysis · Visual Culture and Art Theory
MethodsDiffusion
