WavLM model ensemble for audio deepfake detection
David Combei, Adriana Stan, Dan Oneata, Horia Cucu

TL;DR
This paper improves audio deepfake detection by benchmarking pretrained models, fine-tuning wavLM, and combining multiple models, achieving state-of-the-art results in the ASVspoof5 challenge.
Contribution
It introduces a model ensemble approach using fine-tuned wavLM and data augmentation, setting new performance benchmarks for audio deepfake detection.
Findings
WavLM representations outperform other pretrained models.
Fine-tuning wavLM enhances detection accuracy.
Ensemble of four models achieves EER of 6.56% and 17.08%.
Abstract
Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities. In this paper we address the issue of audio deepfake detection as it was set in the ASVspoof5 challenge. First, we benchmark ten types of pretrained representations and show that the self-supervised representations stemming from the wav2vec2 and wavLM families perform best. Of the two, wavLM is better when restricting the pretraining data to LibriSpeech, as required by the challenge rules. To further improve performance, we finetune the wavLM model for the deepfake detection task. We extend the ASVspoof5 dataset with samples from other deepfake detection datasets and apply data augmentation. Our final challenge submission consists of a late fusion combination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSparse Evolutionary Training
