I Can Hear You: Selective Robust Training for Deepfake Audio Detection
Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel, Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

TL;DR
This paper introduces a large-scale deepfake audio dataset and proposes a frequency-selective adversarial training method to improve detection robustness against diverse attacks and corruptions.
Contribution
The paper presents the largest public deepfake voice dataset and a novel frequency-focused training method to enhance detection robustness.
Findings
Dataset boosts baseline detection performance by 33%.
Robust training improves accuracy by 7.7% on clean and 29.3% on attacked samples.
Frequency-based features are key to detection but vulnerable to manipulation.
Abstract
Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily…
Peer Reviews
Decision·ICLR 2025 Poster
1. Paper is generally well-written and easy to read but some important details are missing 1. DeepFakeVox-HQ is a novel dataset containing data from prior datasets as well as novel deepfakes generated from SOTA speech synthesis models. I appreciate that the authors have curated a test set containing deepfake generation methods not covered in the training set _as well as deepfakes gathered from the internet_. I encourage the authors to consider uploading the dataset to a platform like Huggingfac
1. Some important details about the proposed approaches are not mentioned in the paper. 1. The value of $\epsilon$ and $p$ (or $q$) used in adversarial training methods should be mentioned in the main body of the paper. Currently, it is mentioned in the caption of a table in the appendix 1. The settings used for adversarial attacks during AT, F-SAT and evaluation need to be mentioned. 1. The parameters of the augmentations used in randaugment need to be mentioned at least in the ap
1.DeepFakeVox-HQ stands out as a substantial addition to the field, with over 1.3 million samples, including 270,000 high-quality deepfake samples from 14 sources. This dataset addresses the limitations of existing datasets in diversity and scale, making it a valuable resource for benchmarking future detection models. Releasing this dataset would have a broad impact on the community. 2.The F-SAT method is an important innovation, targeting high-frequency features that are critical for detection
1. The paper does not specify whether baseline models were subjected to adversarial training. If only the F-SAT model received this enhancement, it could bias the results. Including adversarially-trained versions of baseline models using contemporary adversarial methods would provide a fairer comparison and highlight F-SAT’s unique advantages. 2. While F-SAT’s focus on high-frequency components is intriguing, the rationale behind the reliance on high frequencies for detecting deepfake audio cou
Three main contributions involved in this work include (1) a carefully organized dataset, (2) a deepfake detection method, and (3) the ability against adversarial attacks (with the setting focusing on high-frequency signals). In general, the contributions of this work are multi-fold.
My major concern is whether the contributions (or advantages) of this work are over-claimed. Regarding the dataset, although it is well organized and processed, the samples are generated using existing approaches, thus, "the largest" is not a significant contribution. Regarding generalization, as in Table 2, the significantly superior results of the proposed method are achieved on the self-organized dataset, DeepFake Vox-HQ. However, as the author introduced in Section 3, there are overlapped sy
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
MethodsSparse Evolutionary Training
