Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li

TL;DR
This paper introduces FAQ, a comprehensive benchmark for evaluating and improving vision-language models' ability to detect and reason about temporal inconsistencies in deepfake videos, addressing a critical gap in current methods.
Contribution
The paper presents FAQ, a novel multi-level benchmark for temporal deepfake analysis, and demonstrates how fine-tuning models on FAQ-IT enhances their forensic reasoning capabilities.
Findings
Models fine-tuned on FAQ-IT outperform baseline models on detection benchmarks.
FAQ enables models to localize dynamic forgery artifacts across video frames.
Ablation studies confirm FAQ's effectiveness in improving temporal reasoning in VLMs.
Abstract
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Media Forensic Detection
