Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang; Junpeng Yue; Hao Luo; Ziluo Ding; and Zongqing Lu

arXiv:2303.10571·cs.LG·August 6, 2024·1 cites

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo, Ziluo Ding, and Zongqing Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLIP4MC, a novel RL-friendly vision-language model for Minecraft that incorporates task completion into training, enabling better intrinsic rewards for open-ended tasks, with improved RL performance demonstrated on YouTube datasets.

Contribution

The paper proposes a new cross-modal contrastive learning framework, CLIP4MC, that integrates task completion signals into VLM training to enhance RL-friendliness for open-ended tasks.

Findings

01

Achieves better RL task performance than baselines.

02

Provides high-quality YouTube datasets for training.

03

Demonstrates effectiveness in Minecraft environment.

Abstract

One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PKU-RL/CLIP4MC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsContrastive Learning