VLP: Vision-Language Preference Learning for Embodied Manipulation
Runze Liu, Chenjia Bai, Jiafei Lyu, Shengjie Sun, Yali Du, Xiu Li

TL;DR
This paper introduces VLP, a vision-language preference learning framework that uses implicit preference data to improve embodied manipulation tasks in reinforcement learning, reducing reliance on costly human annotations.
Contribution
The paper presents a novel vision-language preference model that learns from implicit data, enabling more efficient preference feedback in embodied manipulation tasks.
Findings
Outperforms baseline methods in simulated tasks
Generalizes well to unseen tasks and instructions
Provides accurate preference annotations
Abstract
Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide preference feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks. The policy can be learned according to the annotated preferences…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This work proposes three forms of language-conditioned preferences: ITP, ILP and IVP. 2. This work proposes a framework vision-language preference learning with theoretic analysis of its behavior. 3. Experiments are well organized to answer four key questions. 4. Experiments show that the proposed VLP leads to better performance than other state-of-the-art baselines.
1. In the experiments, only a single benchmark, Meta-World, is used. - This is limited to show the generality of the proposed preference learning framework. - Related works in section 4 have tested on several different environments. 2. Only five tasks in the Meta-work are evaluated among 50 tasks. - This set of test tasks is not so challenging compared to 45 training tasks. 3. Why are RL-VLM-F and CriticGPT not compared? 4. The dataset construction itself may not be a notable contribut
* This paper presents strong empirical evidence and extensive experiments supporting the proposed approach. * The novel cross-modal architecture effectively fuses video and language through learnable parameters to compute preferences. * Furthermore, the introduction of language-conditioned preferences, namely Intra-Task Preference (ITP), Inter-Language Preference (ILP), and Inter-Video Preference (IVP), is a notable contribution that enhances the model's adaptability across different scenarios
* The theoretical claim seems to lack clear logical reasoning to justify the assertion that "the proposed preference model can be considered as parameterized negative regret that approximates the true negative regret of the whole segment". Although Eq. (10) and Eq. (11) have similar shapes, that does not mean that one approximates the other. * I'm concerned that the simplicity of ILP and IVP definitions may limit VLP's generalizability. The preference labels defined in Table 1 overlook potential
1. The paper presents an effective framework that combines vision-language alignment with preference learning for robotic manipulation tasks. The experimental results show consistent improvements over VLM-based approaches across multiple tasks and demonstrate good generalization performance. 2. The paper is well-structured and easy to follow, presenting its ideas clearly.
1. The evaluation is limited to relatively simple Meta-World tasks, without testing on more complex task domains (e.g., MANISKILL2 [1] and MyoSuite [2]). 2. The paper lacks comparison with human preference labels, which would validate the quality of the generated preferences against human intent. 3. The theoretical analysis assumes access to all possible segments, weakening its practical implications. 4. (minor) The paper does not report the performance of scripted policies, which would help est
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
