DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization
Christos Koutlis, Symeon Papadopoulos

TL;DR
DiMoDif is a novel audio-visual deepfake detection framework that identifies subtle inconsistencies between speech and visual signals to improve detection and localization accuracy, outperforming existing methods on challenging datasets.
Contribution
The paper introduces DiMoDif, a hierarchical cross-modal fusion network with adaptive alignment and discrepancy modeling for enhanced deepfake detection and localization.
Findings
Outperforms state-of-the-art by 30.5 AUC on AV-Deepfake1M
Achieves 47.88 [email protected] in temporal forgery localization
Excels on multiple challenging deepfake datasets
Abstract
Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention
