Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni; Sandipana Dowerah; Atharva Kulkarni; Tanel Alum\"ae; Mathew Magimai Doss

arXiv:2603.06164·cs.SD·March 9, 2026

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alum\"ae, Mathew Magimai Doss

PDF

Open Access 2 Models

TL;DR

This study evaluates the impact of compact SSL backbones like HuBERT and WavLM on audio deepfake detection, revealing that pre-training data and trajectory are more crucial than model size for robustness and calibration.

Contribution

The paper introduces RAPTOR, a unified framework for comparing compact SSL models, and demonstrates that pre-training strategy, not size, determines detection robustness across domains.

Findings

01

Multilingual HuBERT pre-training enhances cross-domain robustness.

02

WavLM variants tend to be overconfident and miscalibrated under perturbations.

03

Model scale is less important than pre-training trajectory for detection reliability.

Abstract

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing