WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
Karthik Inbasekar, Guy Rom, Omer Shlomovits

TL;DR
WorldJen introduces a comprehensive multi-dimensional benchmark for generative video models, combining human preference studies and a VLM-based evaluation to address limitations of existing metrics.
Contribution
It presents a novel evaluation framework that integrates human judgments and a vision-language model to assess multiple quality dimensions simultaneously.
Findings
VLM-based evaluation reproduces human ranking with perfect tier agreement.
The benchmark evaluates 6 state-of-the-art video models across 16 quality dimensions.
A large-scale human preference study establishes a reliable ground-truth rating system.
Abstract
Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
