Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of   Deep Learning Models

Lam Pham; Phat Lam; Truong Nguyen; Huyen Nguyen; Alexander Schindler

arXiv:2407.01777·cs.SD·July 3, 2024·3 cites

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler

PDF

Open Access

TL;DR

This paper introduces a spectrogram-based deep learning ensemble system for detecting deepfake audio, achieving state-of-the-art performance on the ASVspoof 2019 benchmark with an EER of 0.03.

Contribution

It proposes a novel combination of spectrogram transformations, multiple deep learning models, and ensemble techniques for improved deepfake audio detection.

Findings

01

Best ensemble model achieved EER of 0.03 on ASVspoof 2019 dataset

02

Spectrogram transformations and deep learning models significantly enhance detection accuracy

03

Ensemble approach outperforms individual models in deepfake audio detection

Abstract

In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · GoogLeNet · Convolution · 1x1 Convolution · Auxiliary Classifier · Average Pooling · Dropout · Dense Connections · Inception Module · Softmax