BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge
Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky

TL;DR
This paper presents a robust ensemble framework using diverse SSL models and feature augmentation for environmental sound deepfake detection, achieving near-perfect results on unseen generators in the ESDD 2026 Challenge.
Contribution
It introduces a novel ensemble approach with SSL models and feature augmentation to improve generalization to unseen audio synthesis methods.
Findings
Achieved EER of 0.00% on development set
Fusion system reduced EER to 3.52% on progress set
Effective robustness against unseen spectral distortions
Abstract
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Machine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis
