Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment
Jingwei Li, Jiaxin Tong, Pengfei Wu

TL;DR
This paper introduces MSBA-CLIP, a novel deepfake detection framework that uses multimodal alignment, data augmentation, and forgery intensity estimation to improve accuracy and robustness across diverse datasets.
Contribution
The paper proposes a new framework combining multivariate soft blending augmentation and CLIP-guided forgery intensity estimation for enhanced deepfake detection.
Findings
Achieves state-of-the-art accuracy and AUC improvements in in-domain tests.
Demonstrates strong cross-domain generalization across five datasets.
Validates effectiveness of proposed components through ablation studies.
Abstract
The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Face recognition and analysis
