Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Minttu Alakuijala; Reginald McLean; Isaac Woungang; Nariman Farsad; Samuel Kaski; Pekka Marttinen; Kai Yuan

arXiv:2405.19988·cs.RO·September 18, 2025

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan

PDF

Open Access 1 Repo

TL;DR

This paper introduces Video-Language Critic, a transferable reward model trained on cross-embodiment video data, enabling more sample-efficient language-conditioned robot policy learning across different tasks and domains.

Contribution

The paper presents a novel reward model that leverages contrastive learning and temporal ranking on video-language data, separating task specification from robot embodiment for improved transferability.

Findings

01

Enables 2x more sample-efficient policy training on Meta-World tasks.

02

Outperforms prior reward models in generalization and efficiency.

03

Effective across different robot embodiments and task domains.

Abstract

Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minttusofia/video_language_critic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Learning