Reinforcement Learning Friendly Vision-Language Model for Minecraft
Haobin Jiang, Junpeng Yue, Hao Luo, Ziluo Ding, and Zongqing Lu

TL;DR
This paper introduces CLIP4MC, a novel RL-friendly vision-language model for Minecraft that incorporates task completion into training, enabling better intrinsic rewards for open-ended tasks, with improved RL performance demonstrated on YouTube datasets.
Contribution
The paper proposes a new cross-modal contrastive learning framework, CLIP4MC, that integrates task completion signals into VLM training to enhance RL-friendliness for open-ended tasks.
Findings
Achieves better RL task performance than baselines.
Provides high-quality YouTube datasets for training.
Demonstrates effectiveness in Minecraft environment.
Abstract
One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsContrastive Learning
