Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang, Wang, Simon Shaolei Du, Yelong Shen

TL;DR
This paper introduces StoryEval, a new benchmark for evaluating text-to-video models' ability to coherently present sequential story events, revealing current models' limitations in story completion.
Contribution
The paper presents StoryEval, a novel story-oriented benchmark with 423 prompts, and demonstrates its effectiveness in assessing and highlighting the challenges faced by current T2V models.
Findings
None of the evaluated models surpass 50% story completion rate.
StoryEval effectively correlates with human judgment.
Current models struggle with coherent long video story generation.
Abstract
The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Human Motion and Animation
MethodsFocus
