AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection

Ammarah Hashmi; Sahibzada Adil Shahzad; Chia-Wen Lin; Yu Tsao; Hsin-Min Wang

arXiv:2310.13103·cs.CV·July 8, 2025·5 cites

AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection

Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang

PDF

Open Access

TL;DR

This paper introduces AVTENet, a transformer-based ensemble network inspired by human multisensory perception, which effectively detects deepfake videos by integrating audio and visual cues, outperforming existing methods and humans.

Contribution

The study proposes a novel multimodal transformer-based framework, AVTENet, for deepfake detection that comprehensively evaluates audio-visual manipulations on benchmark datasets.

Findings

01

AVTENet achieves state-of-the-art detection accuracy.

02

It outperforms existing methods on FakeAVCeleb dataset.

03

AVTENet surpasses human performance in deepfake detection.

Abstract

The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding