T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video   Generation Models

Xuyang Guo; Jiayan Huo; Zhenmei Shi; Zhao Song; Jiahao Zhang; Jiale; Zhao

arXiv:2505.04946·cs.CV·May 9, 2025

T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale, Zhao

PDF

Open Access

TL;DR

T2VTextBench is a new human-evaluation benchmark designed to assess the ability of text-to-video models to accurately generate on-screen text and maintain temporal consistency, revealing significant challenges in current systems.

Contribution

This paper introduces T2VTextBench, the first benchmark specifically targeting on-screen text fidelity and temporal consistency in text-to-video generation models.

Findings

01

Most models struggle to produce legible, consistent text

02

Current systems show significant gaps in textual manipulation capabilities

03

Benchmark provides a clear direction for future research

Abstract

Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications