Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

Theophile Stourbe; Victor Miara; Theo Lepage; Reda Dehak

arXiv:2409.05032·eess.AS·June 25, 2025

Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

Theophile Stourbe, Victor Miara, Theo Lepage, Reda Dehak

PDF

Open Access

TL;DR

This paper investigates the use of pre-trained WavLM models with various back-end techniques for detecting speech deepfakes, achieving high accuracy through data augmentation and system fusion.

Contribution

It introduces a novel approach combining WavLM representations with back-end methods and data augmentation for improved deepfake detection performance.

Findings

01

Achieved 3.42% EER in deepfake detection

02

Utilized data augmentation with noise and reverberation

03

Enhanced performance through system fusion and calibration

Abstract

This paper describes our submitted systems to the ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition, which consists of a stand-alone speech deepfake (bonafide vs spoof) detection task. Recently, large-scale self-supervised models become a standard in Automatic Speech Recognition (ASR) and other speech processing tasks. Thus, we leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques. The complete framework is fine-tuned using only the trained dataset of the challenge, similar to the close condition. Besides, we adopt data-augmentation by adding noise and reverberation using MUSAN noise and RIR datasets. We also experiment with codec augmentations to increase the performance of our method. Ultimately, we use the Bosaris toolkit for score calibration and system fusion to get better Cllr scores. Our fused…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing