Adapting Vision-Language Models for Evaluating World Models

Mariya Hendriksen; Tabish Rashid; David Bignell; Raluca Georgescu; Abdelhak Lemkhenter; Katja Hofmann; Sam Devlin; Sarah Parisot

arXiv:2506.17967·cs.LG·November 26, 2025

Adapting Vision-Language Models for Evaluating World Models

Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot

PDF

3 Reviews

TL;DR

This paper introduces UNIVERSE, a vision-language model-based evaluation protocol for world models, capable of fine-grained, temporally sensitive assessment of environment simulations, aligning well with human judgments across diverse settings.

Contribution

The paper presents UNIVERSE, a unified, adaptable VLM-based evaluator for video world model rollouts, addressing evaluation challenges with extensive experiments and human validation.

Findings

01

UNIVERSE achieves parity with task-specific evaluators.

02

Strong alignment with human judgments across environments.

03

Effective adaptation methods under data and compute constraints.

Abstract

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. **Well-motivated problem with practical importance:** The paper addresses a genuine need in the world model community for fine-grained, semantically-aware, and temporally-grounded evaluation methods. The limitations of low-level metrics like FID/FVD are well-articulated. 2. **Structured and actionable evaluation protocol:** The AR/CR task decomposition with multiple QA formats provides a clear, operationalizable framework that other researchers can adopt and extend. This structured approach

Weaknesses

1. **Missing comparison with existing evaluation benchmarks:** My primary concern is that the paper motivates UNIVERSE by suggesting existing benchmarks lack certain capabilities, but I could not find any direct experimental comparison with methods like VBench or EvalCrafter on the same data. I would strongly encourage the authors to provide such comparisons, specifically measuring how different evaluation protocols correlate with human judgments on identical rollouts. This would substantiate th

Reviewer 02Rating 8Confidence 3

Strengths

Overall this problem is very timely given recent interest in world models. While this work focuses on a single game, they extensively ablate the different components of their method and show that their approach is effective and likely generalizable to other environments. Particularly the authors thoroughly ablate the training data composition and VLM training method. Additionally, the authors provide a thorough evaluation of the model's generalization to out-of-distribution data by evaluating th

Weaknesses

The main weakness, as mentioned by the authors in the limitations section, is whether this method can be applied beyond video games to other environments. The training data construction method relies on action logs from the game which are not available for other environments and might be costly to acquire for a large set of environments. The problem is likely compounded for real-world simulators used for robotics and other embodied agents.

Reviewer 03Rating 2Confidence 4

Strengths

- The idea of using VLMs for semantic tasks evaluation as a proxy for generation quality evaluation is new and interesting. - The method presents multiple options for training / fine-tuning different subsets of parameters. Notably, only updating the projection head (0.07% of params according to the paper) proves highly effective. This piece of evidence could be important for the community I believe. - The paper is overall well-written, easy to follow, and thorough. The appendix is detailed and i

Weaknesses

`W1`: The paper only presents VLM baselines for world model evaluation, which, according to the authors, were never used before for that purpose. How the method compares to simple world model evaluation approaches such as measuring mean squared error or FVD (between generations and ground truth using a prefix as initial context) using a held out test set, or training control policies under various reward signals (goals) through world model interaction and measuring success via cumulative returns

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.