Is Your World Simulator a Good Story Presenter? A Consecutive   Events-Based Benchmark for Future Long Video Generation

Yiping Wang; Xuehai He; Kuan Wang; Luyao Ma; Jianwei Yang; Shuohang; Wang; Simon Shaolei Du; Yelong Shen

arXiv:2412.16211·cs.CV·December 24, 2024

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang, Wang, Simon Shaolei Du, Yelong Shen

PDF

Open Access

TL;DR

This paper introduces StoryEval, a new benchmark for evaluating text-to-video models' ability to coherently present sequential story events, revealing current models' limitations in story completion.

Contribution

The paper presents StoryEval, a novel story-oriented benchmark with 423 prompts, and demonstrates its effectiveness in assessing and highlighting the challenges faced by current T2V models.

Findings

01

None of the evaluated models surpass 50% story completion rate.

02

StoryEval effectively correlates with human judgment.

03

Current models struggle with coherent long video story generation.

Abstract

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Human Motion and Animation

MethodsFocus