Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David, Lindner

TL;DR
This paper demonstrates that large pretrained vision-language models can serve as effective zero-shot reward models for reinforcement learning, enabling complex task learning from minimal natural language prompts without manual reward engineering.
Contribution
Introducing VLM-RMs, a novel method using pretrained VLMs as zero-shot reward models for RL, reducing the need for manual reward specification and extensive human feedback.
Findings
VLM-RMs successfully train agents for complex tasks with minimal prompts.
Larger VLMs improve reward modeling performance.
Performance scales with model size and training data.
Abstract
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsContrastive Language-Image Pre-training
