SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad; Ammarah Hashmi; Junichi Yamagishi; Yusuke Yasuda; Yu Tsao; Chia-Wen Lin; Yan-Tsung Peng; Hsin-Min Wang

arXiv:2603.25140·cs.CV·March 27, 2026

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang

PDF

Open Access

TL;DR

SAVe introduces a self-supervised framework for audio-visual deepfake detection that learns from authentic videos, using pseudo-manipulations and lip-speech synchronization to identify subtle artifacts and inconsistencies.

Contribution

The paper presents a novel self-supervised approach that does not rely on synthetic training data, using on-the-fly pseudo-manipulations and cross-modal alignment for robust deepfake detection.

Findings

01

Achieves competitive in-domain detection performance.

02

Demonstrates strong cross-dataset generalization.

03

Effective in identifying subtle visual artifacts and audio-visual misalignments.

Abstract

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing