T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

Yubin Chen; Xuyang Guo; Zhenmei Shi; Zhao Song; Jiahao Zhang

arXiv:2507.18107·cs.CV·July 25, 2025

T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, Jiahao Zhang

PDF

Open Access

TL;DR

T2VWorldBench is a comprehensive benchmark designed to evaluate the ability of text-to-video models to incorporate world knowledge, revealing significant gaps in current models' understanding of factual and semantic consistency.

Contribution

This paper introduces the first systematic evaluation framework for assessing world knowledge in text-to-video models, covering diverse domains and combining human and automated assessments.

Findings

01

Most models struggle with understanding world knowledge

02

Current models often generate semantically inconsistent videos

03

Benchmark reveals critical gaps in factual accuracy in T2V models

Abstract

Text-to-video (T2V) models have shown remarkable performance in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely understudied. In response to this challenge, we propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature, activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Human Motion and Animation