NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation
X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang

TL;DR
NarrLV introduces a comprehensive benchmark for evaluating the narrative expression capabilities of long video generation models, inspired by film narrative theory, using novel metrics and prompt generation methods.
Contribution
This paper presents the first benchmark specifically designed to evaluate narrative content in long video generation, incorporating novel metrics and an automatic prompt generation pipeline.
Findings
The proposed metric aligns closely with human judgments.
Current models have limited capabilities in narrative content expression.
NarrLV reveals detailed capability boundaries of existing models.
Abstract
With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper serves a complementary role to VBench. As mentioned by the authors, the VBench series focuses on quality metrics such as controllability, commonsense, and physical plausibility, but notably does not cover narrative metrics. - The definition of TNA is novel and the proposed metrics are both reasonable and appropriate. Inspired by film narrative theory, the authors define TNA as the basic elements of video content. Building upon TNA, the three metrics quantitatively measure diversity,
- This paper does not evaluate more advanced, closed-source models, such as Sora2 and Kling. However, given the significant cost and access constraints associated with these models, this omission is understandable and acceptable for an academic research paper - While Section 3.2 describes the prompt generation pipeline in detail, it lacks a sufficient analysis of the resulting prompts. For example, it is recommended that the authors conduct a statistical analysis of the theme categories within t
- The problem of evaluating the narrative capabilities of long videos is an important and unresolved challenge in the current field. - The authors' attempt to construct a systematic, automated evaluation framework is a direction worth exploring. - The experimental results show a strong correlation with human preferences, which increases the benchmark's credibility as a proxy for human evaluation.
**1. Significant Omission of Long Generation Method** First, I question the rationale for evaluating base models like WAN. on a "long video" benchmark. These models are fundamentally designed to generate short video clips. Evaluating them on tasks far outside their intended design (i.e., long video, which I would argue implies a duration of several minutes) does not seem to yield meaningful insights and may be an unfair comparison. My primary criticism is the benchmark's almost exclusive focus
- Inspired by film narrative theory, the novel benchmark, NArrLV, defines the smallest narrative unit as Temporal Narrative Atom (TNA), which is a quantitative measure of narrative richness in generated video. It further identifies three key dimensions, i.e., scene attribute, target attribute, and target action, that influence the TNA. NarrLV contains a prompt suite, which can flexibly generate prompts with a desired TNA number. - For evaluation, this work follows a progressive narrative express
- W1: I acknowledge that the narrative prompt suite is highly valuable. In lines 190-192, it describes the factors that influence the number of TNA is similar to TC-Bench. Can the author articulate the differentiating factor between TC-Bench and NarrLV? Specifically, I would like to understand what the novelty of NarrLV is that enables it to extend to prompts with a higher number of TNA counts. - W2: I acknowledge that the proposed benchmark suite and evaluation focus on the narrative expressive
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies · Multimedia Communication and Technology · Video Analysis and Summarization
