Movie101v2: Improved Movie Narration Benchmark
Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin

TL;DR
This paper introduces Movie101v2, a large-scale bilingual dataset for automatic movie narration, proposes a staged evaluation framework, and benchmarks large vision-language models, revealing significant research challenges in generating detailed, plot-aware video descriptions.
Contribution
The paper presents Movie101v2, a new high-quality dataset and a staged evaluation framework for movie narration, along with baseline results using advanced vision-language models.
Findings
Large vision-language models struggle with complex movie narration tasks.
The new benchmark reveals significant gaps in current model capabilities.
Progression stages help clarify research directions in movie narration.
Abstract
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation. Our findings highlight that achieving applicable movie narration generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Video Analysis and Summarization
MethodsALIGN · Focus
