Revisiting the Learning Objectives of Vision-Language Reward Models

Simon Roy; Samuel Barbeau; Giovanni Beltrame; Christian Desrosiers; Nicolas Thome

arXiv:2512.20675·cs.LG·December 25, 2025

Revisiting the Learning Objectives of Vision-Language Reward Models

Simon Roy, Samuel Barbeau, Giovanni Beltrame, Christian Desrosiers, Nicolas Thome

PDF

Open Access

TL;DR

This paper evaluates various vision-language reward models under a unified framework, revealing that a simple triplet loss can outperform complex methods, highlighting the importance of training data and architecture choices.

Contribution

It isolates the effect of learning objectives in VLM-based reward models, demonstrating that simpler loss functions can be more effective than complex approaches.

Findings

01

Triplet loss outperforms state-of-the-art methods in reward modeling.

02

Differences in data and architectures significantly impact performance.

03

Unified evaluation framework clarifies the true impact of learning objectives.

Abstract

Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling