ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection
Xin Zhang, Jiaming Chu, Jian Zhao, Yuchu Jiang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li

TL;DR
ERF-BA-TFD+ is a novel multimodal deepfake detection model that combines enhanced receptive fields and audio-visual fusion to improve accuracy and robustness in detecting manipulated multimedia content across both audio and video modalities.
Contribution
The paper introduces ERF-BA-TFD+, a new model that effectively models long-range dependencies in audio-visual data for deepfake detection, achieving state-of-the-art results.
Findings
Achieved state-of-the-art accuracy on DDL-AV dataset.
Outperformed existing methods in detection speed.
Won first place in the DDL-AV competition.
Abstract
Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
