NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
Kyuho Lee, Euntae Kim, Jinwoo Choi, and Buru Chang

TL;DR
This paper introduces NOAH, a benchmark for evaluating how narrative priors in Video LLMs cause hallucinations and omissions, revealing that these errors vary across models and conditions, especially with fewer frames.
Contribution
The paper presents NOAH, a large-scale benchmark for systematically analyzing narrative prior-induced hallucinations and omissions in Video LLMs, enabling controlled evaluation and comparison.
Findings
Most Video LLMs exhibit hallucinations and omissions due to narrative priors.
Error patterns vary across architectures and depend on event similarity and insertion position.
Fewer frames increase reliance on narrative priors, amplifying errors.
Abstract
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. NOAH introduces a novel benchmark to evaluate MLLM hallucinations and omissions, which is critical for video understanding. 2. The paper benchmarks 15 MLLMs, featuring extensive analysis of narrative prior placement and difficulty levels.
1. Insufficient Combination Granularity: Relying on only nine composite types provides a coarse benchmark, making it difficult to sufficiently analyze the granular conditions under which hallucinations and omissions occur. 2. Limited Temporal Scope: The benchmark's limited time range (video duration) prevents exploration of how sequence length interacts with and potentially affects the model's utilization of narrative priors. 3. Inadequacy of CLIP Score: The reliance on CLIP score is a limitati
1. The problem perspective is novel, for the first time systematically attributing the hallucination and omission problems of Video LLMs to a fundamental bias, namely "narrative prior". 2. The paper's evaluation system design is comprehensive, setting up one captioning task and three QA tasks to evaluate the model from multiple dimensions: Existence, Temporal, and Narrative.
1. This study bases its evaluation on another LLM system that also has inherent flaws of hallucination and omission, so its evaluation criteria themselves cannot be precisely verified. 2. The generation method for composite videos is unnatural. The model's errors might reflect its lack of robustness to clipping behavior, rather than a true "narrative prior". 3. When the model omits an inserted clip on NOAH, the paper interprets this as the model actively ignoring it to maintain narrative consist
- This paper proposes a large scale video-QA benchmark with 9k synthetically composed videos. - The benchmark offers diverse (8 unique) metrics from all possible failure scenarios. - RQ2 and RQ3 in Section 4.3 provides new findings to the community.
- I didn’t fully understand the concept of narrative prior-induced hallucinations. Definition of narrative prior from the authors can be rephrased to: “inductive bias of Video LLMs to generate contextually accurate captions grounded by visual evidence”. Then does it mean that Video-LLMs have a bias to ignore contextually irrelevant frames/clips of the video? This paper lacks an explanation why models have such bias, and assumes this prior to be true without proof of concept. Meanwhile, it would
• This is the first benchmark that explicitly investigates the impact of narrative priors in Video LLMs, which is a valuable contribution to the research community. • The benchmark is technically sound and well-constructed.
• My main concern is that the paper’s contributions are limited. As reported in lines 315–317, all methods already show high hallucination and omission rates on the original data, so it’s expected they would perform similarly on videos with inserted events. From my viewpoint, the main findings are restricted to (i) a method who performs better in hallucination and omission than other may behave differently on the new benchmark, and (ii) how the temporal position of inserted events affects final
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
