LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

TL;DR
LoCoT2V-Bench introduces a comprehensive benchmark and evaluation framework for assessing long-form, complex text-to-video generation, highlighting current strengths and key challenges in the field.
Contribution
The paper presents LoCoT2V-Bench and LoCoT2V-Eval, new tools for benchmarking and evaluating long video generation with complex prompts, emphasizing multi-dimensional assessment.
Findings
Models excel in perceptual quality and background consistency.
Fine-grained text-video alignment is weak across models.
Character consistency remains a significant challenge.
Abstract
Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 13 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
