TL;DR
CAMME introduces a multi-modal cross-attention framework that significantly improves deepfake detection accuracy and robustness across unseen generative models and adversarial attacks by integrating visual, textual, and frequency features.
Contribution
The paper presents CAMME, a novel multi-modal cross-attention approach that enhances deepfake detection generalization and robustness against unseen architectures and adversarial perturbations.
Findings
Achieves over 12% improvement in detection accuracy on natural scenes.
Maintains over 91% accuracy under natural image perturbations.
Reaches 89.01% and 96.14% accuracy against PGD and FGSM attacks.
Abstract
The proliferation of sophisticated AI-generated deepfakes poses critical challenges for digital media authentication and societal security. While existing detection methods perform well within specific generative domains, they exhibit significant performance degradation when applied to manipulations produced by unseen architectures--a fundamental limitation as generative technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal Embeddings), a framework that dynamically integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism to establish robust cross-domain generalization. Extensive experiments demonstrate CAMME's superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes. The framework demonstrates exceptional resilience, maintaining (over 91%) accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
