Deepfake Detection of Singing Voices With Whisper Encodings

Falguni Sharma; Priyanka Gupta

arXiv:2501.18919·cs.SD·February 3, 2025

Deepfake Detection of Singing Voices With Whisper Encodings

Falguni Sharma, Priyanka Gupta

PDF

Open Access

TL;DR

This paper introduces a deepfake detection system for singing voices using Whisper model encodings, evaluating their effectiveness across different model sizes and classifiers to identify manipulated vocals.

Contribution

It explores the novel use of Whisper encodings as features for singing voice deepfake detection, highlighting their potential despite being noise-variant.

Findings

01

Whisper encodings can effectively detect singing voice deepfakes.

02

Performance varies with Whisper model size and classifier used.

03

The system achieves promising results in different testing conditions.

Abstract

The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in \%EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing