A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
Hashim Ali, Nithin Sai Adupa, Surya Subramani, Hafiz Malik

TL;DR
This paper introduces Spoof-SUPERB, a benchmark evaluating 20 self-supervised speech models for audio deepfake detection, revealing that discriminative models outperform others and are more robust under acoustic degradations.
Contribution
It systematically evaluates SSL models for audio deepfake detection, establishing a reproducible benchmark and analyzing model robustness and reliability.
Findings
Discriminative models like XLS-R and WavLM Large outperform others.
Large-scale models benefit from multilingual pretraining and speaker-aware objectives.
Discriminative models are more resilient to acoustic degradations.
Abstract
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
