Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu; Yunqi Miao; Xueyi Zhang; Jiankang Deng; Guansong Pang

arXiv:2603.24454·cs.CV·March 26, 2026

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

PDF

Open Access

TL;DR

VLAForge leverages cross-modal vision-language semantics and a novel ForgePerceiver to significantly improve deepfake video detection accuracy across various benchmarks by capturing subtle forgery cues and identity-specific authenticity signals.

Contribution

The paper introduces VLAForge, a framework that enhances vision-language models with a ForgePerceiver and identity-aware scoring to better detect deepfakes, leveraging rich semantics and identity cues.

Findings

01

Outperforms state-of-the-art methods on multiple benchmarks

02

Effective in detecting both face-swapping and full-face generation forgeries

03

Enhances discriminability by integrating cross-modal semantics with identity cues

Abstract

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications