Statistics-aware Audio-visual Deepfake Detector
Marcella Astrid, Enjie Ghorbel, Djamila Aouada

TL;DR
This paper introduces a novel audio-visual deepfake detection method that leverages statistical feature loss, waveform-based audio description, and a shallower network to improve accuracy and efficiency over existing approaches.
Contribution
It proposes a statistical feature loss, waveform audio representation, post-processing normalization, and a shallower network to enhance deepfake detection performance and reduce complexity.
Findings
Effective detection on DFDC and FakeAVCeleb datasets.
Improved discrimination with statistical feature loss.
Reduced computational complexity with shallower network.
Abstract
In this paper, we propose an enhanced audio-visual deep detection method. Recent methods in audio-visual deepfake detection mostly assess the synchronization between audio and visual features. Although they have shown promising results, they are based on the maximization/minimization of isolated feature distances without considering feature statistics. Moreover, they rely on cumbersome deep learning architectures and are heavily dependent on empirically fixed hyperparameters. Herein, to overcome these limitations, we propose: (1) a statistical feature loss to enhance the discrimination capability of the model, instead of relying solely on feature distances; (2) using the waveform for describing the audio as a replacement of frequency-based representations; (3) a post-processing normalization of the fakeness score; (4) the use of shallower network for reducing the computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Speech and Audio Processing · Image and Signal Denoising Methods
