Harder or Different? Understanding Generalization of Audio Deepfake   Detection

Nicolas M. M\"uller; Nicholas Evans; Hemlata Tak; Philip Sperl,; Konstantin B\"ottinger

arXiv:2406.03512·cs.SD·June 13, 2024

Harder or Different? Understanding Generalization of Audio Deepfake Detection

Nicolas M. M\"uller, Nicholas Evans, Hemlata Tak, Philip Sperl,, Konstantin B\"ottinger

PDF

Open Access

TL;DR

This paper investigates why audio deepfake detection models struggle to generalize across different deepfake types, finding that differences between models are the main challenge rather than the increasing difficulty of detection.

Contribution

It decomposes the generalization gap into 'hardness' and 'difference' components, revealing that model differences are the primary obstacle to effective detection.

Findings

01

Performance gap mainly due to differences between deepfake models

02

Hardness of detection is negligible across datasets

03

Increasing model capacity may not improve generalization

Abstract

Recent research has highlighted a key issue in speech deepfake detection: models trained on one set of deepfakes perform poorly on others. The question arises: is this due to the continuously improving quality of Text-to-Speech (TTS) models, i.e., are newer DeepFakes just 'harder' to detect? Or, is it because deepfakes generated with one model are fundamentally different to those generated using another model? We answer this question by decomposing the performance gap between in-domain and out-of-domain test data into 'hardness' and 'difference' components. Experiments performed using ASVspoof databases indicate that the hardness component is practically negligible, with the performance gap being attributed primarily to the difference component. This has direct implications for real-world deepfake detection, highlighting that merely increasing model capacity, the currently-dominant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Digital Media Forensic Detection

MethodsSparse Evolutionary Training