AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
Christos Koutlis, Symeon Papadopoulos

TL;DR
AuViRe introduces a novel audio-visual speech reconstruction method that enhances deepfake temporal localization by exploiting cross-modal discrepancies, significantly outperforming existing techniques on multiple benchmarks.
Contribution
The paper presents a new cross-modal reconstruction approach for deepfake detection that improves temporal localization accuracy over prior methods.
Findings
AuViRe achieves +8.9 [email protected] on LAV-DF
AuViRe achieves +9.6 [email protected] on AV-Deepfake1M
AuViRe achieves +5.1 AUC on in-the-wild data
Abstract
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 [email protected] on LAV-DF, +9.6 [email protected] on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
