I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Zirui Zhang; Wei Hao; Aroon Sankoh; William Lin; Emanuel; Mendiola-Ortiz; Junfeng Yang; Chengzhi Mao

arXiv:2411.00121·cs.SD·November 4, 2024·3 cites

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel, Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a large-scale deepfake audio dataset and proposes a frequency-selective adversarial training method to improve detection robustness against diverse attacks and corruptions.

Contribution

The paper presents the largest public deepfake voice dataset and a novel frequency-focused training method to enhance detection robustness.

Findings

01

Dataset boosts baseline detection performance by 33%.

02

Robust training improves accuracy by 7.7% on clean and 29.3% on attacked samples.

03

Frequency-based features are key to detection but vulnerable to manipulation.

Abstract

Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Paper is generally well-written and easy to read but some important details are missing 1. DeepFakeVox-HQ is a novel dataset containing data from prior datasets as well as novel deepfakes generated from SOTA speech synthesis models. I appreciate that the authors have curated a test set containing deepfake generation methods not covered in the training set _as well as deepfakes gathered from the internet_. I encourage the authors to consider uploading the dataset to a platform like Huggingfac

Weaknesses

1. Some important details about the proposed approaches are not mentioned in the paper. 1. The value of $\epsilon$ and $p$ (or $q$) used in adversarial training methods should be mentioned in the main body of the paper. Currently, it is mentioned in the caption of a table in the appendix 1. The settings used for adversarial attacks during AT, F-SAT and evaluation need to be mentioned. 1. The parameters of the augmentations used in randaugment need to be mentioned at least in the ap

Reviewer 02Rating 8Confidence 5

Strengths

1.DeepFakeVox-HQ stands out as a substantial addition to the field, with over 1.3 million samples, including 270,000 high-quality deepfake samples from 14 sources. This dataset addresses the limitations of existing datasets in diversity and scale, making it a valuable resource for benchmarking future detection models. Releasing this dataset would have a broad impact on the community. 2.The F-SAT method is an important innovation, targeting high-frequency features that are critical for detection

Weaknesses

1. The paper does not specify whether baseline models were subjected to adversarial training. If only the F-SAT model received this enhancement, it could bias the results. Including adversarially-trained versions of baseline models using contemporary adversarial methods would provide a fairer comparison and highlight F-SAT’s unique advantages. 2. While F-SAT’s focus on high-frequency components is intriguing, the rationale behind the reliance on high frequencies for detecting deepfake audio cou

Reviewer 03Rating 6Confidence 3

Strengths

Three main contributions involved in this work include (1) a carefully organized dataset, (2) a deepfake detection method, and (3) the ability against adversarial attacks (with the setting focusing on high-frequency signals). In general, the contributions of this work are multi-fold.

Weaknesses

My major concern is whether the contributions (or advantages) of this work are over-claimed. Regarding the dataset, although it is well organized and processed, the samples are generated using existing approaches, thus, "the largest" is not a significant contribution. Regarding generalization, as in Table 2, the significantly superior results of the proposed method are achieved on the self-organized dataset, DeepFake Vox-HQ. However, as the author introduced in Section 3, there are overlapped sy

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing

MethodsSparse Evolutionary Training