Vision Language Models are In-Context Value Learners

Yecheng Jason Ma; Joey Hejna; Ayzaan Wahid; Chuyuan Fu; Dhruv Shah,; Jacky Liang; Zhuo Xu; Sean Kirmani; Peng Xu; Danny Driess; Ted Xiao; Jonathan; Tompson; Osbert Bastani; Dinesh Jayaraman; Wenhao Yu; Tingnan Zhang; Dorsa; Sadigh; Fei Xia

arXiv:2411.04549·cs.RO·November 8, 2024

Vision Language Models are In-Context Value Learners

Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah,, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Jonathan, Tompson, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa, Sadigh, Fei Xia

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GVL, a universal value function estimator leveraging vision-language models to predict task progress across diverse tasks and domains without additional training, by framing value estimation as a temporal ordering problem.

Contribution

GVL is a novel approach that uses vision-language models for zero-shot and few-shot value prediction in robotic tasks through temporal ordering of shuffled frames.

Findings

01

GVL achieves effective value predictions for over 300 real-world tasks.

02

GVL enables zero-shot and few-shot learning without task-specific training.

03

GVL supports multi-modal in-context learning with heterogeneous examples.

Abstract

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (\GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

Although the proposed technique is simple, it seems (based on myriad experiments) to be effective. Figure 2 demonstrates a clear superiority of GVL over LIV, the baseline that the authors chose to test, for text-based goals. Figure 4 demonstrates that GVL can be decently effective at predicting values correctly over more complex videos. Finally, GVL was effective at training agents via both (1) data filtering for imitation learning and (2) advantage estimation for advantage weighted regression.

Weaknesses

The use of Value Order Correlation (VOC)—the correlation between GVL and frame order—seemed somewhat circular. It is used on the OXE dataset to judge GVL’s performance, a choice that assumes that the contents of OXE are of sufficient quality. (If, for instance, OXE contained many sub-optimal trajectories, an ideal value estimator would show low VOC.) Later, there is a discussion about the use of GVL+VOC to identify sub-optimal datapoints/subsets within OXE. This seemed confusing to me since it s

Reviewer 02Rating 6Confidence 3

Strengths

Pros: 1. The high-level of the paper is easy to understand because of the effort that the authors took to portray it on the figures. 2. The experiments that were performed to evaluate the method are easy to see how the proposed method is better than the baseline/(baselines in some cases).

Weaknesses

Weakness 1. Currently, the block diagram only provides a high-level idea of what the paper is about, which is good. But it would to nice to have the language component of the model. For example, I get that you are trying to predict the value estimate of a frame, but without the task description, it’d be impossible to predict whats the value estimate. 2. The experiments are very limited, although the use cases are diverse enough. In each of these use cases, however, the comparisons are quite rest

Reviewer 03Rating 8Confidence 4

Strengths

- Well motivated and set desideratas for a “universal value estimator”: accurately estimating state and consistency, highlighting the limitations of prior works - Interesting finding; shuffling the video frames help mitigate the temporal bias found in videos ⇒ better value estimations - Introduces a new evaluation metric: Value-Order Correlation which measures how well predicted values correlate with the ground-truth timestep in expert videos. - Extensive evaluation of GVL’s value estimates acro

Weaknesses

- The real-world results are not too impressive. Applying GVL to compute values only leads to a small improvement over the based diffusion policy baseline. It would be interesting to see more results showing how the value estimations of GVL help with downstream policy learning. - Method itself is technically simple, but still well-motivated and supported with empirical results.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition