Foundation Models for Semantic Novelty in Reinforcement Learning
Tarun Gupta, Peter Karkus, Tong Che, Danfei Xu, Marco Pavone

TL;DR
This paper introduces a novel intrinsic reward for reinforcement learning based on foundation models like CLIP, enabling semantically meaningful exploration without additional training, and demonstrating superior performance in complex environments.
Contribution
The paper proposes using pre-trained foundation models as intrinsic rewards in RL, eliminating the need for fine-tuning and improving exploration efficiency.
Findings
CLIP-based intrinsic rewards enhance exploration in sparse environments.
The method outperforms existing state-of-the-art exploration techniques.
Semantic understanding guides RL agents more effectively.
Abstract
Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-trained CLIP embeddings without any fine-tuning or learning on the target RL task. We demonstrate that CLIP-based intrinsic rewards can drive exploration towards semantically meaningful states and outperform state-of-the-art methods in challenging sparse-reward procedurally-generated environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
