DFALLM: Achieving Generalizable Multitask Deepfake Detection by Optimizing Audio LLM Components
Yupei Li, Li Wang, Yuxiang Wang, Lei Wang, Rizhao Cai, Jie Shi, Bj\"orn W. Schuller, Zhizheng Wu

TL;DR
This paper introduces a novel ALLM architecture optimized for generalizable audio deepfake detection, achieving state-of-the-art results across multiple datasets and tasks by carefully selecting model components.
Contribution
It proposes a new ALLM structure that enhances generalization to out-of-domain deepfake detection and related tasks, addressing previous bottlenecks in audio LLM performance.
Findings
Achieves up to 95.76% accuracy on multiple datasets
Outperforms existing models in deepfake attribution and localization
Demonstrates the importance of component selection in ALLMs
Abstract
Audio deepfake detection has recently garnered public concern due to its implications for security and reliability. Traditional deep learning methods have been widely applied to this task but often lack generalisability when confronted with newly emerging spoofing techniques and more tasks such as spoof attribution recognition rather than simple binary classification. In principle, Large Language Models (LLMs) are considered to possess the needed generalisation capabilities. However, previous research on Audio LLMs (ALLMs) indicates a generalization bottleneck in audio deepfake detection performance, even when sufficient data is available. Consequently, this study investigates the model architecture and examines the effects of the primary components of ALLMs, namely the audio encoder and the text-based LLM. Our experiments demonstrate that the careful selection and combination of audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis
