AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Christos Koutlis; Symeon Papadopoulos

arXiv:2511.18993·cs.CV·November 25, 2025

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Christos Koutlis, Symeon Papadopoulos

PDF

Open Access

TL;DR

AuViRe introduces a novel audio-visual speech reconstruction method that enhances deepfake temporal localization by exploiting cross-modal discrepancies, significantly outperforming existing techniques on multiple benchmarks.

Contribution

The paper presents a new cross-modal reconstruction approach for deepfake detection that improves temporal localization accuracy over prior methods.

Findings

01

AuViRe achieves +8.9 [email protected] on LAV-DF

02

AuViRe achieves +9.6 [email protected] on AV-Deepfake1M

03

AuViRe achieves +5.1 AUC on in-the-wild data

Abstract

With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 [email protected] on LAV-DF, +9.6 [email protected] on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection