Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?
Jialu Gao, Kaizhe Hu, Guowei Xu, Huazhe Xu

TL;DR
This paper introduces LfVoid, a novel method that uses pre-trained text-to-image models and image editing to generate goal images for reinforcement learning, enabling robots to learn from natural language instructions without domain-specific training.
Contribution
The paper presents LfVoid, a new approach that leverages pre-trained generative models and image editing to guide robot learning from language instructions without in-domain training.
Findings
LfVoid successfully guides robots in simulated tasks.
The method works in real-world scenarios.
It requires no in-domain training data.
Abstract
Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniques to guide robot learning. Given natural language instructions, LfVoid can edit the original observations to obtain goal images, such as "wiping" a stain off a table. Subsequently, LfVoid trains an ensembled goal discriminator on the generated image to provide reward signals for a reinforcement learning agent, guiding it to achieve the goal. The ability of LfVoid to learn with zero in-domain training on expert demonstrations or true goal observations (the void) is attributed to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
