StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad, Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio, Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender

TL;DR
StoryBench is a comprehensive benchmark designed to evaluate and advance text-to-video generation models across multiple tasks, emphasizing realism, consistency, and adherence to prompts, with a focus on multi-task evaluation and human assessment.
Contribution
It introduces a new multi-task benchmark with annotated datasets for evaluating text-to-video models and provides guidelines for human evaluation and insights into automatic metrics.
Findings
Training on story-like data improves model performance.
Current models struggle with complex story generation tasks.
Guidelines help standardize human evaluation of video stories.
Abstract
Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
