Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Emmanouil Zaranis, Ant\'onio Farinhas, Saul Santos, Beatriz Canaverde, Miguel Moura Ramos, Aditya K Surikuchi, Andr\'e Viveiros, Baohao Liao, Elena Bueno-Benito, Nithin Sivakumaran, Pavlo Vasylenko, Shoubin Yu, Sonal Sannigrahi, Wafaa Mohammed, Ben Peters

TL;DR
MF$^2$ is a new benchmark for assessing models' ability to understand, recall, and reason about full-length movies by evaluating their performance on true-false claim pairs related to key narrative elements.
Contribution
Introduces MF$^2$, a comprehensive benchmark with manually curated claims for long movies, emphasizing deep understanding over superficial detail, and proposing a binary evaluation protocol.
Findings
State-of-the-art models perform poorly compared to humans.
The benchmark reveals current models' limitations in narrative comprehension.
Humans significantly outperform models in recalling and reasoning about movie content.
Abstract
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850…
Peer Reviews
Decision·Submitted to ICLR 2026
1. MF2 introduces a new evaluation paradigm for long-form narrative video understanding, moving beyond prior benchmarks that focus on short-term visual recall or detail-oriented retrieval. It provides a more realistic and cognitively demanding test of narrative comprehension. 2. All samples are manually constructed by the research team after watching entire movies, ensuring annotation quality and consistency. The dataset covers multiple reasoning granularities (single-scene, multi-scene, global
- **Overstated Novelty Compared to Existing Long-Video Benchmarks:** While MF2 leverages full-length movies to evaluate long-term reasoning, there already exist several general long-video understanding benchmarks. The authors’ claim that prior works focus only on "peripheral or low-level details" and lack "abstractive understanding of the central storyline" is somewhat overstated. For instance, HourVideo also involves hour-long videos and tasks such as causal and counterfactual reasoning, which
- Benchmark datasets are valuable assets for our community. The authors commit to releasing the full dataset, code, and movies, ensuring reproducibility and supporting future research. - Movies are long-form video that mostly self-contained stories. MF² focuses on holistic narrative understanding on full-length movies, requiring models to reason about the story’s core elements, unlike previous benchmarks that emphasize “needle-in-a-haystack” details. This work seems to pioneer a new area to deal
- While minimal-edit fibs are effective and annotators filters ambiguous cases, some may be too obvious or, conversely, too subtle, potentially confusing both humans and models. - Considering the previous benchmarks, this reviewer do not intend to challenge the current configuration. While the authors directly compares the values from the models and humans presented in Table 3, humans would need to watch the video and subtitles without sound for fair comparison. - For multi-scene cases, the ran
The benchmark's use of open-licensed content and human-annotated claims ensures reproducibility and high-quality labels, addressing gaps in existing datasets prone to copyright issues or automated generation. The contrastive claim design and granular categorization (e.g., single-scene to global reasoning) enable a nuanced assessment of narrative comprehension, with human baselines highlighting the task's feasibility for humans but difficulty for models.
1. The movies in MF2 are exclusively from 1920–1970 (to avoid data contamination), lacking modern films with contemporary narrative styles, visual effects, or cultural contexts. This limits the generalization of results to real-world scenarios involving recent long videos (e.g., modern films, documentaries). It is suggested that the authors discuss this limitation. 2. All annotators are co-authors of the paper, rather than independent external annotators. This may introduce subjective biases in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
