Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Yuanjian Chen; Yang Xiao; Han Yin; Yadong Guan; and Xubo Liu

arXiv:2508.07176·cs.SD·August 12, 2025

Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Yuanjian Chen, Yang Xiao, Han Yin, Yadong Guan, and Xubo Liu

PDF

Open Access

TL;DR

This paper introduces a novel multi-task learning framework combining event appearance detection and sound event detection to improve robustness in noisy environments, leveraging language-queried sound separation.

Contribution

It proposes a co-training-based multi-task framework with event counting and explicit constraints, enhancing SED performance under noisy conditions, which is a new approach in this domain.

Findings

01

Outperforms existing methods on DESED and WildDESED datasets.

02

Shows increased robustness at higher noise levels.

03

Provides more reliable clip-level and timestamp predictions.

Abstract

Most sound event detection (SED) systems perform well on clean datasets but degrade significantly in noisy environments. Language-queried audio source separation (LASS) models show promise for robust SED by separating target events; existing methods require elaborate multi-stage training and lack explicit guidance for target events. To address these challenges, we introduce event appearance detection (EAD), a counting-based approach that counts event occurrences at both the clip and frame levels. Based on EAD, we propose a co-training-based multi-task learning framework for EAD and SED to enhance SED's performance in noisy environments. First, SED struggles to learn the same patterns as EAD. Then, a task-based constraint is designed to improve prediction consistency between SED and EAD. This framework provides more reliable clip-level predictions for LASS models and strengthens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis