Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models
Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji, Buduru, Rajesh Sharma, S.R Mahadeva Prasanna

TL;DR
This paper compares music and speech foundation models for singing voice deepfake detection, finding speech models, especially speaker recognition ones, perform best, and introduces FIONA, a fusion framework that achieves state-of-the-art results.
Contribution
It is the first comprehensive comparison of MFMs and SFMs for SVDD and proposes a novel fusion framework, FIONA, to improve detection performance.
Findings
Speaker recognition SFMs outperform other models in SVDD.
Fusion of FMs via FIONA improves detection accuracy.
FIONA achieves the lowest EER of 13.74%, surpassing previous methods.
Abstract
In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained for general speech representation learning as well as speaker recognition). We show that speaker recognition SFM representations perform the best amongst all the foundation models (FMs), and this performance can be attributed to its higher efficacy in capturing the pitch, tone, intensity, etc, characteristics present in singing voices. To our end, we also explore the fusion of FMs for exploiting their complementary behavior for improved SVDD, and we propose a novel framework, FIONA for the same.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
