Retrieval-Augmented Audio Deepfake Detection
Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing, Xiao, Jianzong Wang

TL;DR
This paper introduces a retrieval-augmented detection framework for audio deepfake detection, enhancing performance by leveraging similar retrieved samples, and achieves state-of-the-art results on multiple datasets.
Contribution
It proposes a novel retrieval-augmented detection framework combined with a multi-fusion attentive classifier for improved audio deepfake detection.
Findings
Achieves state-of-the-art results on ASVspoof 2021 DF set.
Outperforms baseline methods on multiple datasets.
Retrieval improves detection by focusing on speaker-specific acoustic features.
Abstract
With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
