Environmental Sound Deepfake Detection Using Deep-Learning Framework

Lam Pham; Khoi Vu; Dat Tran; Phat Lam; Vu Nguyen; David Fischinger; Son Le

arXiv:2604.19652·cs.SD·May 4, 2026

Environmental Sound Deepfake Detection Using Deep-Learning Framework

Lam Pham, Khoi Vu, Dat Tran, Phat Lam, Vu Nguyen, David Fischinger, Son Le

PDF

TL;DR

This paper introduces a deep-learning framework for detecting deepfake environmental sounds, emphasizing the importance of task-specific detection and the effectiveness of fine-tuning pre-trained models.

Contribution

It demonstrates that separate detection of sound scene and sound event deepfakes improves accuracy and shows fine-tuning pre-trained models like WavLM enhances performance.

Findings

01

Achieved 0.98 accuracy on EnvSDD dataset

02

F1 score of 0.95 on EnvSDD dataset

03

Fine-tuning pre-trained models outperforms training from scratch

Abstract

In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.