INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR
This paper introduces INFACT, a comprehensive benchmark for diagnosing faithfulness and factuality hallucinations in Video-LLMs, revealing that current models often lack robustness under various induced conditions.
Contribution
The paper presents INFACT, a new diagnostic benchmark with fine-grained evaluations for faithfulness and factuality in Video-LLMs, including diverse induced modes and reliability metrics.
Findings
Higher base accuracy does not ensure reliability in induced modes.
Evidence corruption significantly reduces model stability.
Many open-source models show minimal temporal sensitivity on factuality questions.
Abstract
Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
