AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies
Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng

TL;DR
AVT2-DWF introduces a dual-transformer framework with dynamic weight fusion to effectively detect deepfakes by leveraging both audio and visual cues, achieving state-of-the-art results across multiple datasets.
Contribution
It presents a novel dual-stage audio-visual transformer with dynamic weighting strategies for improved deepfake detection, addressing multi-modal fusion challenges.
Findings
Achieves state-of-the-art performance on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets.
Effectively captures spatial and temporal features of facial expressions.
Enhances intra- and cross-dataset deepfake detection capabilities.
Abstract
With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
