Are Music Foundation Models Better at Singing Voice Deepfake Detection?   Far-Better Fuse them with Speech Foundation Models

Orchid Chetia Phukan; Sarthak Jain; Swarup Ranjan Behera; Arun Balaji; Buduru; Rajesh Sharma; S.R Mahadeva Prasanna

arXiv:2409.14131·eess.AS·September 24, 2024

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji, Buduru, Rajesh Sharma, S.R Mahadeva Prasanna

PDF

Open Access

TL;DR

This paper compares music and speech foundation models for singing voice deepfake detection, finding speech models, especially speaker recognition ones, perform best, and introduces FIONA, a fusion framework that achieves state-of-the-art results.

Contribution

It is the first comprehensive comparison of MFMs and SFMs for SVDD and proposes a novel fusion framework, FIONA, to improve detection performance.

Findings

01

Speaker recognition SFMs outperform other models in SVDD.

02

Fusion of FMs via FIONA improves detection accuracy.

03

FIONA achieves the lowest EER of 13.74%, surpassing previous methods.

Abstract

In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained for general speech representation learning as well as speaker recognition). We show that speaker recognition SFM representations perform the best amongst all the foundation models (FMs), and this performance can be attributed to its higher efficacy in capturing the pitch, tone, intensity, etc, characteristics present in singing voices. To our end, we also explore the fusion of FMs for exploiting their complementary behavior for improved SVDD, and we propose a novel framework, FIONA for the same.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need