Unleashing Vision-Language Semantics for Deepfake Video Detection
Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

TL;DR
VLAForge leverages cross-modal vision-language semantics and a novel ForgePerceiver to significantly improve deepfake video detection accuracy across various benchmarks by capturing subtle forgery cues and identity-specific authenticity signals.
Contribution
The paper introduces VLAForge, a framework that enhances vision-language models with a ForgePerceiver and identity-aware scoring to better detect deepfakes, leveraging rich semantics and identity cues.
Findings
Outperforms state-of-the-art methods on multiple benchmarks
Effective in detecting both face-swapping and full-face generation forgeries
Enhances discriminability by integrating cross-modal semantics with identity cues
Abstract
Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
