Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Pengfei Cai; Yan Song; Qing Gu; Nan Jiang; Haoyu Song; Ian McLoughlin

arXiv:2507.16343·cs.SD·October 28, 2025

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin

PDF

Open Access

TL;DR

This paper introduces DASM, a novel query-based framework for open-vocabulary sound event detection that uses multi-modal queries and a dual-stream decoder to improve detection and generalization to unseen classes.

Contribution

DASM formulates SED as a frame-level retrieval task with a dual-stream decoder and an inference-time attention masking strategy, enabling effective open-vocabulary detection and cross-modal feature fusion.

Findings

01

Outperforms CLAP-based methods in open-vocabulary setting (+7.8 PSDS)

02

Surpasses baseline in closed-set setting (+6.9 PSDS)

03

Achieves 42.2 PSDS1 score in cross-dataset zero-shot evaluation

Abstract

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques