X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh

TL;DR
This paper introduces X-AVDT, a deepfake detection method that exploits internal cross-attention cues from generative models, achieving robust and generalizable detection across diverse synthesis techniques and unseen generators.
Contribution
The paper presents X-AVDT, a novel detector utilizing generator-internal audio-visual signals and introduces MMDF, a comprehensive dataset for evaluating deepfake detection methods.
Findings
X-AVDT outperforms existing methods with 13.1% higher accuracy.
It generalizes well to unseen generators and external benchmarks.
Internal cross-attention cues are effective for robust deepfake detection.
Abstract
The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Face recognition and analysis
