AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Trevine Oorloff, Surya Koppisetti, Nicol\`o Bonettini, Divyaraj, Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

TL;DR
This paper introduces AVFF, a two-stage cross-modal learning approach that explicitly models audio-visual correspondences to improve deepfake detection accuracy, outperforming existing methods significantly.
Contribution
The paper proposes a novel two-stage audio-visual feature fusion method with self-supervised representation learning and a new masking strategy for enhanced deepfake detection.
Findings
Achieved 98.6% accuracy on FakeAVCeleb dataset.
Outperformed state-of-the-art by 14.9% in accuracy.
Demonstrated high discriminative power of learned representations.
Abstract
With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods
MethodsContrastive Learning
