Can DeepFake Speech be Reliably Detected?
Hongbin Liu, Youzheng Chen, Arun Narayanan, Athula Balachandran, Pedro, J. Moreno, Lun Wang

TL;DR
This paper systematically studies the vulnerabilities of current synthetic speech detectors against active malicious attacks, revealing the need for more robust detection methods amid evolving threats.
Contribution
It provides the first comprehensive analysis of active attacks on open-source SSDs, including white-box and black-box methods, and evaluates their effectiveness and stealthiness.
Findings
Current SSDs are vulnerable to active attacks.
Active attacks can deceive detectors with high success rates.
Robust detection methods are urgently needed.
Abstract
Recent advances in text-to-speech (TTS) systems, particularly those with voice cloning capabilities, have made voice impersonation readily accessible, raising ethical and legal concerns due to potential misuse for malicious activities like misinformation campaigns and fraud. While synthetic speech detectors (SSDs) exist to combat this, they are vulnerable to ``test domain shift", exhibiting decreased performance when audio is altered through transcoding, playback, or background noise. This vulnerability is further exacerbated by deliberate manipulation of synthetic speech aimed at deceiving detectors. This work presents the first systematic study of such active malicious attacks against state-of-the-art open-source SSDs. White-box attacks, black-box attacks, and their transferability are studied from both attack effectiveness and stealthiness, using both hardcoded metrics and human…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The manuscript is not too difficult to follow.
1.Lack of Novelty: The challenges related to deepfake audio detection, particularly the decline in performance across domains, are well-established in the literature. The vulnerability of detection models to adversarial attacks is also widely recognized. Therefore, dedicating an entire paper to simply illustrating these issues does not contribute new insights to the field. 2.No Technical Contribution: The paper primarily applies classic adversarial attack methods to examine the vulnerability of
1. The paper covers a wide range of attack scenarios (white-box, black-box, and agnostic), providing a holistic view of SSD vulnerabilities under varying levels of attacker access. 2. The study employs both objective metrics (e.g., VisQOL scores) and subjective human ratings, ensuring that attack success considers not only detectability but also audio quality, suggesting a potential threat that currently, the perturbed audio can easily bypass the SSD without loss of audio quality. 3. Given the r
1. The paper only evaluates the robustness of four SSDs against adversarial attacks, which may limit the overall scope and depth of its findings. More advanced SSDs should be incorporated. 2. The evaluation may not be sound. For example, in section 3.2, the paper claims that "Deepfake speech from TTS not seen during training is more likely to bypass SSDs.". However, it does not provide the benign detection performance of the SSDs across different datasets, which is crucial as existing SSDs often
- The problem of unreliable deepfake detectors is of primary concern to users, such as fact-checkers and investigators, who want to rely on their performance out-of-the-box. The paper hits at the core problem with deepfake detectors. - The paper is overall clear to read and provides several results to ground their insights on. - The presenting of takeaways and key findings makes the paper accessible for readers.
As this paper asks an evaluation question, **it needs to back it with substantive evaluations**. I suggest the following improvements: - The complete evaluation is performed on 100 samples from 3 text-to-speech (TTS) datasets. Essentially basing all analysis on 300 samples. There are newer speech-to-speech datasets, such as DECRO [1]. Moreover, audio deepfake datasets are easy to create. If we have 100 speakers then we can create 100 * 99 (9,900) by inferring a speech generator, such as FREEVC,
**Comprehensive Evaluation:** One strength of this work is its thorough evaluation. The authors test all three types of threat models (white-box, black-box, and agnostic) against various deepfake detectors. **Human Evaluation:** Additionally, it is valuable that the authors assess the quality of deepfake audio through human studies, showing that the attacks do not produce discernible changes to the audio.
**Vulnerability Already Well-Known:** Deep learning-based audio models are already known to be susceptible to evasion attacks [1], with extensive literature on optimization-based and signal processing-based attacks. In fact, the signal processing family of attacks might be even better suited than optimization based ones (like PGD and iFGSM as used in the paper) since they do not even require query access to the model. At this stage, the community would benefit more from methods to enhance robus
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
