Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

Kuiyuan Zhang; Wenjie Pei; Rushi Lan; Yifang Guo; Zhongyun Hua

arXiv:2506.07358·cs.SD·June 10, 2025

Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

Kuiyuan Zhang, Wenjie Pei, Rushi Lan, Yifang Guo, Zhongyun Hua

PDF

Open Access

TL;DR

This paper introduces a lightweight, single-stream multi-modal learning framework for audio-visual deepfake detection that efficiently integrates features and improves robustness against mismatched modalities.

Contribution

The work proposes a novel single-stream network with a collaborative learning block and multi-modal classification module, enhancing efficiency and detection performance over existing methods.

Findings

01

Achieves superior detection accuracy on multiple benchmarks.

02

Reduces model size to only 0.48 million parameters.

03

Improves robustness against modality mismatches.

Abstract

Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub-models can result in redundant neural layers, making the overall model inefficient and impractical for resource-constrained environments. In this work, we design a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework. Specifically, we introduce a collaborative audio-visual learning block to efficiently integrate multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Music and Audio Processing