NarrativeTrack: Evaluating Entity-Centric Reasoning for Narrative Understanding
Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty

TL;DR
NarrativeTrack introduces a new benchmark and framework for evaluating how well multimodal large language models understand and reason about entities in dynamic video narratives, highlighting current limitations.
Contribution
It presents the first benchmark for entity-centric narrative understanding in videos and a structured evaluation framework to measure reasoning complexity.
Findings
State-of-the-art models struggle with entity tracking across visual transitions.
Models show a trade-off between perceptual grounding and temporal reasoning.
Current models often hallucinate entity identities under context shifts.
Abstract
Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
