Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors
Hasam Khalid, Minha Kim, Shahroz Tariq, Simon S. Woo

TL;DR
This paper evaluates a new multimodal deepfake dataset containing both audio and video for detecting deepfakes, demonstrating that multimodal detectors outperform unimodal ones, but pure multimodal methods currently underperform ensemble approaches.
Contribution
It introduces the Audio-Video Multimodal Deepfake Detection Dataset and provides comprehensive baseline experiments comparing unimodal, ensemble, and multimodal detection methods.
Findings
Unimodal detectors perform worse than ensemble-based methods.
Purely multimodal detectors show the poorest performance among tested methods.
Ensemble methods outperform both unimodal and pure multimodal approaches.
Abstract
Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
