DiMoDif: Discourse Modality-information Differentiation for Audio-visual   Deepfake Detection and Localization

Christos Koutlis; Symeon Papadopoulos

arXiv:2411.10193·cs.CV·April 14, 2025

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

Christos Koutlis, Symeon Papadopoulos

PDF

Open Access

TL;DR

DiMoDif is a novel audio-visual deepfake detection framework that identifies subtle inconsistencies between speech and visual signals to improve detection and localization accuracy, outperforming existing methods on challenging datasets.

Contribution

The paper introduces DiMoDif, a hierarchical cross-modal fusion network with adaptive alignment and discrepancy modeling for enhanced deepfake detection and localization.

Findings

01

Outperforms state-of-the-art by 30.5 AUC on AV-Deepfake1M

02

Achieves 47.88 [email protected] in temporal forgery localization

03

Excels on multiple challenging deepfake datasets

Abstract

Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention