Movie101v2: Improved Movie Narration Benchmark

Zihao Yue; Yepeng Zhang; Ziheng Wang; Qin Jin

arXiv:2404.13370·cs.CV·October 21, 2024·1 cites

Movie101v2: Improved Movie Narration Benchmark

Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Movie101v2, a large-scale bilingual dataset for automatic movie narration, proposes a staged evaluation framework, and benchmarks large vision-language models, revealing significant research challenges in generating detailed, plot-aware video descriptions.

Contribution

The paper presents Movie101v2, a new high-quality dataset and a staged evaluation framework for movie narration, along with baseline results using advanced vision-language models.

Findings

01

Large vision-language models struggle with complex movie narration tasks.

02

The new benchmark reveals significant gaps in current model capabilities.

03

Progression stages help clarify research directions in movie narration.

Abstract

Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation. Our findings highlight that achieving applicable movie narration generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuezih/movie101
pytorch

Datasets

yuezih/Movie101
dataset· 554 dl
554 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Video Analysis and Summarization

MethodsALIGN · Focus