WavLM model ensemble for audio deepfake detection

David Combei; Adriana Stan; Dan Oneata; Horia Cucu

arXiv:2408.07414·eess.AS·August 15, 2024

WavLM model ensemble for audio deepfake detection

David Combei, Adriana Stan, Dan Oneata, Horia Cucu

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper improves audio deepfake detection by benchmarking pretrained models, fine-tuning wavLM, and combining multiple models, achieving state-of-the-art results in the ASVspoof5 challenge.

Contribution

It introduces a model ensemble approach using fine-tuned wavLM and data augmentation, setting new performance benchmarks for audio deepfake detection.

Findings

01

WavLM representations outperform other pretrained models.

02

Fine-tuning wavLM enhances detection accuracy.

03

Ensemble of four models achieves EER of 6.56% and 17.08%.

Abstract

Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities. In this paper we address the issue of audio deepfake detection as it was set in the ASVspoof5 challenge. First, we benchmark ten types of pretrained representations and show that the self-supervised representations stemming from the wav2vec2 and wavLM families perform best. Of the two, wavLM is better when restricting the pretraining data to LibriSpeech, as required by the challenge rules. To further improve performance, we finetune the wavLM model for the deepfake detection task. We extend the ASVspoof5 dataset with samples from other deepfake detection datasets and apply data augmentation. Our final challenge submission consists of a late fusion combination…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
DavidCombei/wavLM-base-Deepfake_V2
model· 463 dl
463 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training