Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection
Inbal Rimon, Oren Gal, Haim Permuter

TL;DR
This paper introduces a hybrid deepfake speech detection framework using novel spectrogram and feature masking augmentations, combined with compression-aware self-supervised learning, achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes a dual-stage masking approach and a compression-aware training strategy within a unified model for improved deepfake speech detection.
Findings
Achieved 4.08% EER on ASVspoof5 Challenge (Track 1)
Obtained 0.18% EER on ASVspoof2019 evaluation set
Reached 2.92% EER on ASVspoof2021 DF task
Abstract
Deepfake speech detection presents a growing challenge as generative audio technologies continue to advance. We propose a hybrid training framework that advances detection performance through novel augmentation strategies. First, we introduce a dual-stage masking approach that operates both at the spectrogram level (MaskedSpec) and within the latent feature space (MaskedFeature), providing complementary regularization that improves tolerance to localized distortions and enhances generalization learning. Second, we introduce compression-aware strategy during self-supervised to increase variability in low-resource scenarios while preserving the integrity of learned representations, thereby improving the suitability of pretrained features for deepfake detection. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
