X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Youngseo Kim; Kwan Yun; Seokhyeon Hong; Sihun Cha; Colette Suhjung Koo; Junyong Noh

arXiv:2603.08483·cs.CV·March 11, 2026

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh

PDF

Open Access 1 Datasets

TL;DR

This paper introduces X-AVDT, a deepfake detection method that exploits internal cross-attention cues from generative models, achieving robust and generalizable detection across diverse synthesis techniques and unseen generators.

Contribution

The paper presents X-AVDT, a novel detector utilizing generator-internal audio-visual signals and introduces MMDF, a comprehensive dataset for evaluating deepfake detection methods.

Findings

01

X-AVDT outperforms existing methods with 13.1% higher accuracy.

02

It generalizes well to unseen generators and external benchmarks.

03

Internal cross-attention cues are effective for robust deepfake detection.

Abstract

The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zaqxsw0526/MMDF
dataset· 213 dl
213 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Face recognition and analysis