Narrative Aligned Long Form Video Question Answering
Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler

TL;DR
NA-VQA is a new benchmark for evaluating deep narrative reasoning in long videos, emphasizing the importance of understanding story structures over shallow cues, and introduces Video-NaRA, a framework that enhances reasoning by modeling event chains.
Contribution
The paper introduces NA-VQA, a comprehensive long-video reasoning benchmark, and proposes Video-NaRA, a novel narrative-centric model that improves long-range reasoning capabilities.
Findings
State-of-the-art models perform poorly on far-range evidence questions.
Video-NaRA improves reasoning performance by up to 3%.
NA-VQA enables evaluation of narrative reasoning in long videos.
Abstract
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
