Movie2Story: A framework for understanding videos and telling stories in the form of novel text
Kangning Li, Zheyang Jia, Anyu Ying

TL;DR
This paper introduces MSBench, a new benchmark for evaluating multi-modal story generation from videos with rich auxiliary data, revealing current models' limitations and proposing improvements.
Contribution
It presents a novel benchmark and dataset generation method for assessing multi-modal story generation, along with a new model architecture to enhance performance.
Findings
Current models perform poorly on the benchmark.
Automated dataset creation reduces manual effort.
Proposed model shows improved results on MSBench.
Abstract
In recent years, large-scale models have achieved significant advancements, accompanied by the emergence of numerous high-quality benchmarks for evaluating various aspects of their comprehension abilities. However, most existing benchmarks primarily focus on spatial understanding in static image tasks. While some benchmarks extend evaluations to temporal tasks, they fall short in assessing text generation under complex contexts involving long videos and rich auxiliary information. To address this limitation, we propose a novel benchmark: the Multi-modal Story Generation Benchmark (MSBench), designed to evaluate text generation capabilities in scenarios enriched with auxiliary information. Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information. On one hand, we leverage existing datasets and apply automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsFocus
