Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, and, Sebastian M\"oller

TL;DR
This study provides a detailed layer-wise analysis of SSL models for audio deepfake detection, revealing that lower transformer layers are most effective and enabling reduced computational costs without sacrificing performance.
Contribution
It offers the first comprehensive layer-wise analysis of SSL models in audio deepfake detection, highlighting the importance of lower layers and demonstrating efficient model configurations.
Findings
Lower layers are most discriminative for deepfake detection.
Models maintain competitive EER scores with fewer layers.
Using only a few lower layers reduces computational costs.
Abstract
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Media Forensic Detection · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
