Can Pre-Trained Text-to-Image Models Generate Visual Goals for   Reinforcement Learning?

Jialu Gao; Kaizhe Hu; Guowei Xu; Huazhe Xu

arXiv:2307.07837·cs.RO·July 18, 2023·1 cites

Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?

Jialu Gao, Kaizhe Hu, Guowei Xu, Huazhe Xu

PDF

Open Access 1 Video

TL;DR

This paper introduces LfVoid, a novel method that uses pre-trained text-to-image models and image editing to generate goal images for reinforcement learning, enabling robots to learn from natural language instructions without domain-specific training.

Contribution

The paper presents LfVoid, a new approach that leverages pre-trained generative models and image editing to guide robot learning from language instructions without in-domain training.

Findings

01

LfVoid successfully guides robots in simulated tasks.

02

The method works in real-world scenarios.

03

It requires no in-domain training data.

Abstract

Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniques to guide robot learning. Given natural language instructions, LfVoid can edit the original observations to obtain goal images, such as "wiping" a stain off a table. Subsequently, LfVoid trains an ensembled goal discriminator on the generated image to provide reward signals for a reinforcement learning agent, guiding it to achieve the goal. The ability of LfVoid to learn with zero in-domain training on expert demonstrations or true goal observations (the void) is attributed to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications