ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

Xueping Zhang; Han Yin; Yang Xiao; Lin Zhang; Ting Dang; Rohan Kumar Das; Ming Li

arXiv:2601.07303·cs.SD·February 6, 2026

ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

PDF

Open Access

TL;DR

This paper introduces ESDD2, a challenge focused on detecting deepfake audio manipulations at the component level, supported by a large dataset and a novel joint learning framework to improve detection in realistic scenarios.

Contribution

It presents the CompSpoofV2 dataset and a separation-enhanced joint learning framework for component-level audio deepfake detection, addressing a gap in current detection methods.

Findings

01

CompSpoofV2 contains over 250,000 samples for training and evaluation.

02

The joint learning framework improves detection accuracy in component-level deepfake scenarios.

03

The challenge promotes research on more realistic and robust audio deepfake detection methods.

Abstract

Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis