Environmental Sound Deepfake Detection Using Deep-Learning Framework
Lam Pham, Khoi Vu, Dat Tran, Phat Lam, Vu Nguyen, David Fischinger, Son Le

TL;DR
This paper introduces a deep-learning framework for detecting deepfake environmental sounds, emphasizing the importance of task-specific detection and the effectiveness of fine-tuning pre-trained models.
Contribution
It demonstrates that separate detection of sound scene and sound event deepfakes improves accuracy and shows fine-tuning pre-trained models like WavLM enhances performance.
Findings
Achieved 0.98 accuracy on EnvSDD dataset
F1 score of 0.95 on EnvSDD dataset
Fine-tuning pre-trained models outperforms training from scratch
Abstract
In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
