Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu; Qingsong Zhao; Yusong Wang; Zhaohong Huang; Xinqi Li; Cheng Yuan; Jiaowei Shao; Chi Zhang; Xuelong Li

arXiv:2602.21779·cs.CV·February 26, 2026

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces FAQ, a comprehensive benchmark for evaluating and improving vision-language models' ability to detect and reason about temporal inconsistencies in deepfake videos, addressing a critical gap in current methods.

Contribution

The paper presents FAQ, a novel multi-level benchmark for temporal deepfake analysis, and demonstrates how fine-tuning models on FAQ-IT enhances their forensic reasoning capabilities.

Findings

01

Models fine-tuned on FAQ-IT outperform baseline models on detection benchmarks.

02

FAQ enables models to localize dynamic forgery artifacts across video frames.

03

Ablation studies confirm FAQ's effectiveness in improving temporal reasoning in VLMs.

Abstract

Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Media Forensic Detection