Speech Foundation Model Ensembles for the Controlled Singing Voice   Deepfake Detection (CtrSVDD) Challenge 2024

Anmol Guragain; Tianchi Liu; Zihan Pan; Hardik B. Sailor; Qiongqiong; Wang

arXiv:2409.02302·eess.AS·October 22, 2024

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong, Wang

PDF

Open Access 2 Repos

TL;DR

This paper presents an ensemble approach using speech foundation models and a novel SEA method to achieve state-of-the-art detection of deepfake singing voices with a 1.79% EER in the CtrSVDD challenge.

Contribution

Introduction of a novel SEA aggregation method combined with speech foundation model ensembles for robust singing voice deepfake detection.

Findings

01

Achieved 1.79% pooled EER on CtrSVDD evaluation set

02

SEA method outperforms individual models

03

Ensemble approach enhances detection robustness

Abstract

This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSparse Evolutionary Training