WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Karthik Inbasekar; Guy Rom; Omer Shlomovits

arXiv:2605.03475·cs.CV·May 7, 2026

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Karthik Inbasekar, Guy Rom, Omer Shlomovits

PDF

1 Repo 1 Datasets

TL;DR

WorldJen introduces a comprehensive multi-dimensional benchmark for generative video models, combining human preference studies and a VLM-based evaluation to address limitations of existing metrics.

Contribution

It presents a novel evaluation framework that integrates human judgments and a vision-language model to assess multiple quality dimensions simultaneously.

Findings

01

VLM-based evaluation reproduces human ranking with perfect tier agreement.

02

The benchmark evaluates 6 state-of-the-art video models across 16 quality dimensions.

03

A large-scale human preference study establishes a reliable ground-truth rating system.

Abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://moonmath.ai/worldjen
github

Datasets

ik6626/WorldJen-benchmarking-subsystem
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.