Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026
Candy Olivia Mawalim, Haotian Zhang, Shogo Okada

TL;DR
This paper describes the Nomi Team's approach to the 2026 Environmental Sound Deepfake Detection Challenge, using an audio-text cross-attention model to improve detection of synthetic environmental sounds.
Contribution
It introduces a novel audio-text cross-attention model tailored for environmental sound deepfake detection, addressing unseen generators and low-resource scenarios.
Findings
Achieved competitive EER improvements over baseline models
Demonstrated effectiveness of combined text-audio models
Addressed challenges of unseen sound generators
Abstract
This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
