Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alum\"ae, Mathew Magimai Doss

TL;DR
This study evaluates the impact of compact SSL backbones like HuBERT and WavLM on audio deepfake detection, revealing that pre-training data and trajectory are more crucial than model size for robustness and calibration.
Contribution
The paper introduces RAPTOR, a unified framework for comparing compact SSL models, and demonstrates that pre-training strategy, not size, determines detection robustness across domains.
Findings
Multilingual HuBERT pre-training enhances cross-domain robustness.
WavLM variants tend to be overconfident and miscalibrated under perturbations.
Model scale is less important than pre-training trajectory for detection reliability.
Abstract
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing
