Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
Bo Cheng, Songjun Cao, Xiaoming Zhang, Jie Chen, Long Ma, Fei Chen

TL;DR
This paper introduces a diffusion-based hard sample generation framework with contrastive learning to improve generalization in Audio Deepfake Detection, effectively handling unseen attacks.
Contribution
It proposes a novel diffusion reconstruction method combined with RACL to enhance model robustness against unseen audio deepfake attacks.
Findings
Significant reduction in average EER compared to baseline.
Diffusion-based hard sample generation outperforms other reconstruction paradigms.
Enhanced generalization demonstrated through extensive experiments.
Abstract
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
