Vision-Language Models Provide Promptable Representations for   Reinforcement Learning

William Chen; Oier Mees; Aviral Kumar; Sergey Levine

arXiv:2402.02651·cs.LG·May 24, 2024·6 cites

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees, Aviral Kumar, Sergey Levine

PDF

Open Access

TL;DR

This paper introduces a method to leverage vision-language models as promptable, semantic representations for reinforcement learning, enabling agents to utilize background knowledge and reasoning for improved performance in complex tasks.

Contribution

The paper presents a novel approach that uses pre-trained vision-language models as promptable embeddings to enhance reinforcement learning policies in complex environments.

Findings

01

Promptable VLM embeddings outperform non-promptable image embeddings.

02

Approach surpasses instruction-following methods in RL tasks.

03

Chain-of-thought prompting improves performance in novel scenes.

Abstract

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics