Pre-trained Text-to-Image Diffusion Models Are Versatile Representation   Learners for Control

Gunshi Gupta; Karmesh Yadav; Yarin Gal; Dhruv Batra; Zsolt Kira; Cong; Lu; Tim G. J. Rudner

arXiv:2405.05852·cs.CV·May 12, 2024

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong, Lu, Tim G. J. Rudner

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper demonstrates that representations from pre-trained text-to-image diffusion models can be used to improve embodied AI control tasks, providing fine-grained scene understanding and enabling better generalization in complex environments.

Contribution

It introduces Stable Control Representations derived from text-to-image diffusion models, enhancing embodied AI control policies beyond existing contrastive methods.

Findings

01

Policies using Stable Control Representations perform competitively across various tasks.

02

Achieves state-of-the-art results on the OVMM open-vocabulary navigation benchmark.

03

Enables learning of fine-grained, generalizable control policies in complex environments.

Abstract

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ykarmesh/stable-control-representations
pytorchOfficial

Models

🤗
ykarmesh/stable-control-representations
model· ♡ 1
♡ 1

Videos

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control· slideslive

Taxonomy

TopicsModel Reduction and Neural Networks · Machine Learning in Healthcare · Neural Networks and Applications

MethodsDiffusion · Contrastive Language-Image Pre-training