Exploring Text-Queried Sound Event Detection with Audio Source Separation
Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen,, Rohan Kumar Das, Chong Deng, Jianfeng Chen

TL;DR
This paper introduces a text-queried sound event detection framework that leverages a pre-trained language-queried source separation model, enhanced with a dual-path RNN, to improve detection accuracy in overlapping sound scenarios.
Contribution
It proposes a novel TQ-SED framework combining language-queried source separation with a dual-path RNN, achieving state-of-the-art results in language-queried audio source separation.
Findings
TQ-SED improves F1 score by 7.22% over conventional methods.
AudioSep-DP achieves first place in DCASE 2024 Task 9.
Enhanced model complexity impacts separation performance.
Abstract
In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Advanced Text Analysis Techniques
