TL;DR
CAM-VFD introduces a cross-attention multimodal framework leveraging cross-modal contradictions for robust video forgery detection, outperforming single-modality detectors and maintaining stability under various perturbations.
Contribution
It proposes a novel cross-attention fusion mechanism that models cross-modal contradictions for improved video forgery detection, demonstrating superior accuracy and robustness.
Findings
95.31% Top-1 accuracy on GenVidBench
93.43% accuracy and 90.63% F1-score on GenVideo
Stable performance under compression, noise, blur, and adversarial attacks
Abstract
The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
