Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

Candy Olivia Mawalim; Haotian Zhang; Shogo Okada

arXiv:2512.06041·cs.SD·December 9, 2025

Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

Candy Olivia Mawalim, Haotian Zhang, Shogo Okada

PDF

Open Access

TL;DR

This paper describes the Nomi Team's approach to the 2026 Environmental Sound Deepfake Detection Challenge, using an audio-text cross-attention model to improve detection of synthetic environmental sounds.

Contribution

It introduces a novel audio-text cross-attention model tailored for environmental sound deepfake detection, addressing unseen generators and low-resource scenarios.

Findings

01

Achieved competitive EER improvements over baseline models

02

Demonstrated effectiveness of combined text-audio models

03

Addressed challenges of unseen sound generators

Abstract

This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders