Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong, Lu, Tim G. J. Rudner

TL;DR
This paper demonstrates that representations from pre-trained text-to-image diffusion models can be used to improve embodied AI control tasks, providing fine-grained scene understanding and enabling better generalization in complex environments.
Contribution
It introduces Stable Control Representations derived from text-to-image diffusion models, enhancing embodied AI control policies beyond existing contrastive methods.
Findings
Policies using Stable Control Representations perform competitively across various tasks.
Achieves state-of-the-art results on the OVMM open-vocabulary navigation benchmark.
Enables learning of fine-grained, generalizable control policies in complex environments.
Abstract
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Machine Learning in Healthcare · Neural Networks and Applications
MethodsDiffusion · Contrastive Language-Image Pre-training
